diff mbox

qemu: piix: PCI bridge ACPI hotplug support

Message ID 1370882347-31129-1-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin June 10, 2013, 4:41 p.m. UTC
This adds support for device hotplug behind
pci bridges. Bridge devices themselves need
to be pre-configured on qemu command line.

One of the goals of this project was to
demonstrate the advantages of generating
ACPI tables in QEMU.
In my opinion, it does this successfully.

       * This saves the need to provide a new interface for firmware
         to discover bus number to pci brige mapping
       * No need for yet another interface to bios to detect qemu version
         to check if it's safe to activate new code,
         and to ship multiple table versions:
         we generated code so we know this is new qemu
       * Use of glib's GArray makes it much easier to build
         up tables in code without need for iasl and code patching

I have also tried to address the concerns that
Anthony has raised with this approach, please see below.

Design:
    - each bus gets assigned a number 0-255
    - generated ACPI code writes this number
      to a new BSEL register, then uses existing
      UP/DOWN registers to probe slot status;
      to eject, write number to BSEL register,
      then slot into existing EJ

    This is to address the ACPI spec requirement to
    avoid config cycle access to any bus except PCI roots.

Portability:
    - Non x86 (or any Linux) platforms don't need any of this code.
      They can keep happily using SHPC the way
      they always did.

Things to note:
    - this is on top of acpi patchset posted a month ago,
      with a small patch adding a core function to walk all
      pci buses, on top.
      Can also be found in my git tree
        git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git acpi

    - Extensive use of glib completely removes
      pointer math in new code: we use
      g_array_append_vals exclusively.
      There's no code patching in new code.
      This is in response to comments about security
      concerns with adding code to qemu.
      In this sense it is more secure than existing
      code in hw/acpi/core.c - that can be switched to
      glib if desired.

    - New code (not imported from seabios)
      uses glib to reduce dependency on iasl.
      With time all code can be rewritted to
      use glib and avoid iasl, if desired.

    - As was the case previously,
      systems that lack working iasl are detected at configure time,
      pre-generated hex files in source tree are used in this case.
      This addresses the concern with iasl/big-endian
      systems.

    - Compile tested only. Migration is known to be
      broken - there are a bunch of TODO tags in code.
      It was agreed on the last qemu conf meeting that
      this code is posted without looking at migration,
      with understanding that it is addressed before
      being merged. Please do not mistake this
      limitation for a fundamental one - I have a
      very good idea how to fix it, including
      cross-version migration.

    - Cross version migration: when running with -M 1.5
      and older, all ACPI table generation should be disabled.
      We'll present FW_CFG interface compatible with 1.5.

Cc: Jordan Justen <jljusten@gmail.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: seabios@seabios.org
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Kevin O'Connor <kevin@koconnor.net>
Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 docs/specs/acpi_pci_hotplug.txt  |   8 +
 include/hw/i386/pc.h             |   4 +-
 hw/i386/acpi-dsdt.dsl            |  36 +++-
 hw/i386/ssdt-pcihp.dsl           |  51 -----
 hw/acpi/piix4.c                  | 145 ++++++++++----
 hw/i386/acpi-build.c             | 411 ++++++++++++++++++++++++++++++++++-----
 hw/i386/Makefile.objs            |   2 +-
 hw/i386/ssdt-pcihp.hex.generated | 108 ----------
 8 files changed, 510 insertions(+), 255 deletions(-)
 delete mode 100644 hw/i386/ssdt-pcihp.dsl
 delete mode 100644 hw/i386/ssdt-pcihp.hex.generated

Comments

Anthony Liguori June 10, 2013, 6:58 p.m. UTC | #1
"Michael S. Tsirkin" <mst@redhat.com> writes:

> This adds support for device hotplug behind
> pci bridges. Bridge devices themselves need
> to be pre-configured on qemu command line.
>
> One of the goals of this project was to
> demonstrate the advantages of generating
> ACPI tables in QEMU.
> In my opinion, it does this successfully.

Since you've gone out of your way to make this difficult to actually
review...

>
>        * This saves the need to provide a new interface for firmware
>          to discover bus number to pci brige mapping

Do you mean fw_cfg?  The BIOS normally does bus numbering.  I see no
reason for it not to.

>        * No need for yet another interface to bios to detect qemu version
>          to check if it's safe to activate new code,
>          and to ship multiple table versions:

We only care about supporting one version of SeaBIOS.  SeaBIOS should
only care about supporting one version of QEMU.  We're not asking it to
support multiple versions.

>          we generated code so we know this is new qemu
>        * Use of glib's GArray makes it much easier to build
>          up tables in code without need for iasl and code patching

Adding a dynamic array to SeaBIOS isn't rocket science.

>
> I have also tried to address the concerns that
> Anthony has raised with this approach, please see below.
>
> Design:
>     - each bus gets assigned a number 0-255
>     - generated ACPI code writes this number
>       to a new BSEL register, then uses existing
>       UP/DOWN registers to probe slot status;
>       to eject, write number to BSEL register,
>       then slot into existing EJ
>
>     This is to address the ACPI spec requirement to
>     avoid config cycle access to any bus except PCI roots.
>
> Portability:
>     - Non x86 (or any Linux) platforms don't need any of this code.
>       They can keep happily using SHPC the way
>       they always did.
>
> Things to note:
>     - this is on top of acpi patchset posted a month ago,
>       with a small patch adding a core function to walk all
>       pci buses, on top.
>       Can also be found in my git tree
>         git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git acpi
>
>     - Extensive use of glib completely removes
>       pointer math in new code: we use
>       g_array_append_vals exclusively.
>       There's no code patching in new code.
>       This is in response to comments about security
>       concerns with adding code to qemu.
>       In this sense it is more secure than existing
>       code in hw/acpi/core.c - that can be switched to
>       glib if desired.
>
>     - New code (not imported from seabios)
>       uses glib to reduce dependency on iasl.
>       With time all code can be rewritted to
>       use glib and avoid iasl, if desired.
>
>     - As was the case previously,
>       systems that lack working iasl are detected at configure time,
>       pre-generated hex files in source tree are used in this case.
>       This addresses the concern with iasl/big-endian
>       systems.
>
>     - Compile tested only. Migration is known to be
>       broken - there are a bunch of TODO tags in code.
>       It was agreed on the last qemu conf meeting that
>       this code is posted without looking at migration,
>       with understanding that it is addressed before
>       being merged. Please do not mistake this
>       limitation for a fundamental one - I have a
>       very good idea how to fix it, including
>       cross-version migration.
>
>     - Cross version migration: when running with -M 1.5
>       and older, all ACPI table generation should be disabled.
>       We'll present FW_CFG interface compatible with 1.5.

So TL;DR, you break a bunch of stuff and introduce a mess of code.  It
would be less code and wouldn't break anything to add this logic to
SeaBIOS.

How is this even a discussion?

Regards,

Anthony Liguori

>
> Cc: Jordan Justen <jljusten@gmail.com>
> Cc: Anthony Liguori <anthony@codemonkey.ws>
> Cc: Laszlo Ersek <lersek@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: seabios@seabios.org
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Kevin O'Connor <kevin@koconnor.net>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  docs/specs/acpi_pci_hotplug.txt  |   8 +
>  include/hw/i386/pc.h             |   4 +-
>  hw/i386/acpi-dsdt.dsl            |  36 +++-
>  hw/i386/ssdt-pcihp.dsl           |  51 -----
>  hw/acpi/piix4.c                  | 145 ++++++++++----
>  hw/i386/acpi-build.c             | 411 ++++++++++++++++++++++++++++++++++-----
>  hw/i386/Makefile.objs            |   2 +-
>  hw/i386/ssdt-pcihp.hex.generated | 108 ----------
>  8 files changed, 510 insertions(+), 255 deletions(-)
>  delete mode 100644 hw/i386/ssdt-pcihp.dsl
>  delete mode 100644 hw/i386/ssdt-pcihp.hex.generated
>
> diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
> index a839434..951b99a 100644
> --- a/docs/specs/acpi_pci_hotplug.txt
> +++ b/docs/specs/acpi_pci_hotplug.txt
> @@ -43,3 +43,11 @@ PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
>  
>  Used by ACPI BIOS _RMV method to indicate removability status to OS. One
>  bit per slot.  Read-only
> +
> +PCI bus selector (IO port 0xae10-0xae13, 4-byte access):
> +-----------------------------------------------
> +
> +Used by ACPI BIOS methods to select the current PCI bus.
> +When written, makes all the other PCI registers (0xae00 - 0xae0f)
> +to refer to the appropriate bus.
> +0 selects PCI root bus (default).
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index f8d0871..66ec787 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -38,7 +38,9 @@ struct PcGuestInfo {
>      bool s3_disabled;
>      bool s4_disabled;
>      uint8_t s4_val;
> -    DECLARE_BITMAP(slot_hotplug_enable, PCI_SLOT_MAX);
> +#define GUEST_INFO_MAX_HOTPLUG_BUS 256
> +    PCIBus *hotplug_buses[GUEST_INFO_MAX_HOTPLUG_BUS];
> +    DECLARE_BITMAP(hotplug_enable[GUEST_INFO_MAX_HOTPLUG_BUS], PCI_SLOT_MAX);
>      uint16_t sci_int;
>      uint8_t acpi_enable_cmd;
>      uint8_t acpi_disable_cmd;
> diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
> index 90efce0..dc05668 100644
> --- a/hw/i386/acpi-dsdt.dsl
> +++ b/hw/i386/acpi-dsdt.dsl
> @@ -133,11 +133,18 @@ DefinitionBlock (
>              B0EJ, 32,
>          }
>  
> +        OperationRegion(SEL, SystemIO, 0xae10, 0x04)
> +        Field(SEL, DWordAcc, NoLock, WriteAsZeros) {
> +            BSEL, 32,
> +        }
> +
>          /* Methods called by bulk generated PCI devices below */
> +        Name(BNUM, 0x00)
>  
>          /* Methods called by hotplug devices */
> -        Method(PCEJ, 1, NotSerialized) {
> +        Method(PCEJ, 2, NotSerialized) {
>              // _EJ0 method - eject callback
> +            Store(Arg1, BSEL)
>              Store(ShiftLeft(1, Arg0), B0EJ)
>              Return (0x0)
>          }
> @@ -147,16 +154,25 @@ DefinitionBlock (
>  
>          /* PCI hotplug notify method */
>          Method(PCNF, 0) {
> -            // Local0 = iterator
> +            // Local0 = bridge selector
>              Store(Zero, Local0)
> -            While (LLess(Local0, 31)) {
> -                Increment(Local0)
> -                If (And(PCIU, ShiftLeft(1, Local0))) {
> -                    PCNT(Local0, 1)
> -                }
> -                If (And(PCID, ShiftLeft(1, Local0))) {
> -                    PCNT(Local0, 3)
> -                }
> +            While (LLess(Local0, 255)) {
> +                    Store(Local0, BSEL)
> +                    // Local1 = slot iterator
> +                    Store(Zero, Local1)
> +
> +                    Store(PCIU, Local2)
> +                    Store(PCID, Local3)
> +                    While (LLess(Local1, 31)) {
> +                        Increment(Local1)
> +                        If (And(Local2, ShiftLeft(1, Local1))) {
> +                            PCNT(Add(Local1, Multiply(Local0, 32)), 1)
> +                        }
> +                        If (And(Local3, ShiftLeft(1, Local1))) {
> +                            PCNT(Add(Local1, Multiply(Local0, 32)), 3)
> +                        }
> +                    }
> +                    Increment(Local0)
>              }
>          }
>      }
> diff --git a/hw/i386/ssdt-pcihp.dsl b/hw/i386/ssdt-pcihp.dsl
> deleted file mode 100644
> index d29a5b9..0000000
> --- a/hw/i386/ssdt-pcihp.dsl
> +++ /dev/null
> @@ -1,51 +0,0 @@
> -/*
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; either version 2 of the License, or
> - * (at your option) any later version.
> -
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> -
> - * You should have received a copy of the GNU General Public License along
> - * with this program; if not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -ACPI_EXTRACT_ALL_CODE ssdp_pcihp_aml
> -
> -DefinitionBlock ("ssdt-pcihp.aml", "SSDT", 0x01, "BXPC", "BXSSDTPCIHP", 0x1)
> -{
> -
> -/****************************************************************
> - * PCI hotplug
> - ****************************************************************/
> -
> -    /* Objects supplied by DSDT */
> -    External(\_SB.PCI0, DeviceObj)
> -    External(\_SB.PCI0.PCEJ, MethodObj)
> -
> -    Scope(\_SB.PCI0) {
> -
> -        /* Bulk generated PCI hotplug devices */
> -        ACPI_EXTRACT_DEVICE_START ssdt_pcihp_start
> -        ACPI_EXTRACT_DEVICE_END ssdt_pcihp_end
> -        ACPI_EXTRACT_DEVICE_STRING ssdt_pcihp_name
> -
> -        // Method _EJ0 can be patched by BIOS to EJ0_
> -        // at runtime, if the slot is detected to not support hotplug.
> -        // Extract the offset of the address dword and the
> -        // _EJ0 name to allow this patching.
> -        Device(SAA) {
> -            ACPI_EXTRACT_NAME_BYTE_CONST ssdt_pcihp_id
> -            Name(_SUN, 0xAA)
> -            ACPI_EXTRACT_NAME_DWORD_CONST ssdt_pcihp_adr
> -            Name(_ADR, 0xAA0000)
> -            ACPI_EXTRACT_METHOD_STRING ssdt_pcihp_ej0
> -            Method(_EJ0, 1) {
> -                Return (PCEJ(_SUN))
> -            }
> -        }
> -    }
> -}
> diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
> index 19a26f5..10106fd 100644
> --- a/hw/acpi/piix4.c
> +++ b/hw/acpi/piix4.c
> @@ -29,6 +29,7 @@
>  #include "exec/ioport.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "exec/address-spaces.h"
> +#include "hw/pci/pci_bus.h"
>  
>  //#define DEBUG
>  
> @@ -42,11 +43,12 @@
>  #define GPE_LEN 4
>  
>  #define PCI_HOTPLUG_ADDR 0xae00
> -#define PCI_HOTPLUG_SIZE 0x000f
> +#define PCI_HOTPLUG_SIZE 0x0014
>  #define PCI_UP_BASE 0xae00
>  #define PCI_DOWN_BASE 0xae04
>  #define PCI_EJ_BASE 0xae08
>  #define PCI_RMV_BASE 0xae0c
> +#define PCI_SEL_BASE 0xae10
>  
>  #define PIIX4_PROC_BASE 0xaf00
>  #define PIIX4_PROC_LEN 32
> @@ -57,6 +59,8 @@
>  struct pci_status {
>      uint32_t up; /* deprecated, maintained for migration compatibility */
>      uint32_t down;
> +    uint32_t hotplug_enable;
> +    uint32_t device_present;
>  };
>  
>  typedef struct CPUStatus {
> @@ -84,9 +88,8 @@ typedef struct PIIX4PMState {
>      Notifier powerdown_notifier;
>  
>      /* for pci hotplug */
> -    struct pci_status pci0_status;
> -    uint32_t pci0_hotplug_enable;
> -    uint32_t pci0_slot_device_present;
> +    struct pci_status pci_status[GUEST_INFO_MAX_HOTPLUG_BUS];
> +    uint32_t hotplug_select;
>  
>      uint8_t disable_s3;
>      uint8_t disable_s4;
> @@ -98,6 +101,17 @@ typedef struct PIIX4PMState {
>      PcGuestInfo *guest_info;
>  } PIIX4PMState;
>  
> +static int piix4_find_hotplug_bus(PcGuestInfo *guest_info, PCIBus *bus)
> +{
> +    int i;
> +    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
> +        if (guest_info->hotplug_buses[i] == bus) {
> +            return i;
> +        }
> +    }
> +    return -1;
> +}
> +
>  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
>                                             PCIBus *bus, PIIX4PMState *s);
>  
> @@ -183,6 +197,8 @@ static void pm_write_config(PCIDevice *d,
>  
>  static void vmstate_pci_status_pre_save(void *opaque)
>  {
> +#if 0
> +    /* TODO */
>      struct pci_status *pci0_status = opaque;
>      PIIX4PMState *s = container_of(pci0_status, PIIX4PMState, pci0_status);
>  
> @@ -190,6 +206,7 @@ static void vmstate_pci_status_pre_save(void *opaque)
>       * to a version that still does... of course these might get lost
>       * by an old buggy implementation, but we try. */
>      pci0_status->up = s->pci0_slot_device_present & s->pci0_hotplug_enable;
> +#endif
>  }
>  
>  static int vmstate_acpi_post_load(void *opaque, int version_id)
> @@ -267,7 +284,10 @@ static int acpi_load_old(QEMUFile *f, void *opaque, int version_id)
>          qemu_get_be16s(f, &temp);
>      }
>  
> +#if 0
> +    TODO
>      ret = vmstate_load_state(f, &vmstate_pci_status, &s->pci0_status, 1);
> +#endif
>      return ret;
>  }
>  
> @@ -293,21 +313,28 @@ static const VMStateDescription vmstate_acpi = {
>          VMSTATE_TIMER(ar.tmr.timer, PIIX4PMState),
>          VMSTATE_INT64(ar.tmr.overflow_time, PIIX4PMState),
>          VMSTATE_STRUCT(ar.gpe, PIIX4PMState, 2, vmstate_gpe, ACPIGPE),
> +#if 0
>          VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
>                         struct pci_status),
> +#endif
>          VMSTATE_END_OF_LIST()
>      }
>  };
>  
> -static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned bsel, unsigned slots)
>  {
>      BusChild *kid, *next;
> -    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> +    BusState *bus;
>      int slot = ffs(slots) - 1;
>      bool slot_free = true;
>  
> +    if (!s->guest_info->hotplug_buses[bsel]) {
> +        return;
> +    }
> +    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
> +
>      /* Mark request as complete */
> -    s->pci0_status.down &= ~(1U << slot);
> +    s->pci_status[bsel].down &= ~(1U << slot);
>  
>      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
>          DeviceState *qdev = kid->child;
> @@ -322,23 +349,26 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
>          }
>      }
>      if (slot_free) {
> -        s->pci0_slot_device_present &= ~(1U << slot);
> +        s->pci_status[bsel].device_present &= ~(1U << slot);
>      }
>  }
>  
> -static void piix4_update_hotplug(PIIX4PMState *s)
> +static void piix4_update_hotplug(PIIX4PMState *s, unsigned bsel)
>  {
> -    PCIDevice *dev = &s->dev;
> -    BusState *bus = qdev_get_parent_bus(&dev->qdev);
>      BusChild *kid, *next;
> +    BusState *bus;
>  
> +    if (!s->guest_info->hotplug_buses[bsel]) {
> +        return;
> +    }
> +    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
>      /* Execute any pending removes during reset */
> -    while (s->pci0_status.down) {
> -        acpi_piix_eject_slot(s, s->pci0_status.down);
> +    while (s->pci_status[bsel].down) {
> +        acpi_piix_eject_slot(s, bsel, s->pci_status[bsel].down);
>      }
>  
> -    s->pci0_hotplug_enable = ~0;
> -    s->pci0_slot_device_present = 0;
> +    s->pci_status[bsel].hotplug_enable = ~0;
> +    s->pci_status[bsel].device_present = 0;
>  
>      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
>          DeviceState *qdev = kid->child;
> @@ -347,10 +377,10 @@ static void piix4_update_hotplug(PIIX4PMState *s)
>          int slot = PCI_SLOT(pdev->devfn);
>  
>          if (pc->no_hotplug) {
> -            s->pci0_hotplug_enable &= ~(1U << slot);
> +            s->pci_status[bsel].hotplug_enable &= ~(1U << slot);
>          }
>  
> -        s->pci0_slot_device_present |= (1U << slot);
> +        s->pci_status[bsel].device_present |= (1U << slot);
>      }
>  }
>  
> @@ -358,6 +388,7 @@ static void piix4_reset(void *opaque)
>  {
>      PIIX4PMState *s = opaque;
>      uint8_t *pci_conf = s->dev.config;
> +    int i;
>  
>      pci_conf[0x58] = 0;
>      pci_conf[0x59] = 0;
> @@ -371,7 +402,9 @@ static void piix4_reset(void *opaque)
>          /* Mark SMM as already inited (until KVM supports SMM). */
>          pci_conf[0x5B] = 0x02;
>      }
> -    piix4_update_hotplug(s);
> +    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
> +        piix4_update_hotplug(s, i);
> +    }
>  }
>  
>  static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
> @@ -382,14 +415,20 @@ static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
>      acpi_pm1_evt_power_down(&s->ar);
>  }
>  
> -static void piix4_update_guest_info(PIIX4PMState *s)
> +static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> +                                PCIHotplugState state);
> +
> +static void piix4_update_bus_guest_info(PCIBus *pci_bus, void *opaque)
>  {
> -    PCIDevice *dev = &s->dev;
> -    BusState *bus = qdev_get_parent_bus(&dev->qdev);
> +    PIIX4PMState *s = opaque;
> +    PcGuestInfo *guest_info = s->guest_info;
>      BusChild *kid, *next;
> +    BusState *bus = &pci_bus->qbus;
>  
> -    memset(s->guest_info->slot_hotplug_enable, 0xff,
> -           DIV_ROUND_UP(PCI_SLOT_MAX, BITS_PER_BYTE));
> +    int i = piix4_find_hotplug_bus(guest_info, NULL);
> +    assert(i >= 0);
> +
> +    guest_info->hotplug_buses[i] = pci_bus;
>  
>      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
>          DeviceState *qdev = kid->child;
> @@ -398,9 +437,22 @@ static void piix4_update_guest_info(PIIX4PMState *s)
>          int slot = PCI_SLOT(pdev->devfn);
>  
>          if (pc->no_hotplug) {
> -            clear_bit(slot, s->guest_info->slot_hotplug_enable);
> +            clear_bit(slot, guest_info->hotplug_enable[i]);
>          }
>      }
> +
> +    pci_bus_hotplug(pci_bus, piix4_device_hotplug, &s->dev.qdev);
> +}
> +
> +static void piix4_update_guest_info(PIIX4PMState *s)
> +{
> +    PCIDevice *dev = &s->dev;
> +
> +    memset(s->guest_info->hotplug_enable, 0xff,
> +           sizeof s->guest_info->hotplug_enable);
> +
> +    pci_for_each_bus(dev->bus, piix4_update_bus_guest_info,
> +                     s);
>  }
>  
>  static void piix4_pm_machine_ready(Notifier *n, void *opaque)
> @@ -589,16 +641,22 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
>  {
>      PIIX4PMState *s = opaque;
>      uint32_t val = 0;
> +    int bsel = s->hotplug_select;
> +
> +    if (bsel > GUEST_INFO_MAX_HOTPLUG_BUS) {
> +        return 0;
> +    }
>  
>      switch (addr) {
>      case PCI_UP_BASE - PCI_HOTPLUG_ADDR:
>          /* Manufacture an "up" value to cause a device check on any hotplug
>           * slot with a device.  Extra device checks are harmless. */
> -        val = s->pci0_slot_device_present & s->pci0_hotplug_enable;
> +        val = s->pci_status[bsel].device_present &
> +            s->pci_status[bsel].hotplug_enable;
>          PIIX4_DPRINTF("pci_up_read %x\n", val);
>          break;
>      case PCI_DOWN_BASE - PCI_HOTPLUG_ADDR:
> -        val = s->pci0_status.down;
> +        val = s->pci_status[bsel].down;
>          PIIX4_DPRINTF("pci_down_read %x\n", val);
>          break;
>      case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
> @@ -606,8 +664,10 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
>          PIIX4_DPRINTF("pci_features_read %x\n", val);
>          break;
>      case PCI_RMV_BASE - PCI_HOTPLUG_ADDR:
> -        val = s->pci0_hotplug_enable;
> +        val = s->pci_status[bsel].hotplug_enable;
>          break;
> +    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
> +        val = s->hotplug_select;
>      default:
>          break;
>      }
> @@ -618,12 +678,18 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
>  static void pci_write(void *opaque, hwaddr addr, uint64_t data,
>                        unsigned int size)
>  {
> +    PIIX4PMState *s = opaque;
>      switch (addr) {
>      case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
> -        acpi_piix_eject_slot(opaque, (uint32_t)data);
> +        if (s->hotplug_select >= GUEST_INFO_MAX_HOTPLUG_BUS) {
> +            break;
> +        }
> +        acpi_piix_eject_slot(s, s->hotplug_select, data);
>          PIIX4_DPRINTF("pciej write %" HWADDR_PRIx " <== % " PRIu64 "\n",
>                        addr, data);
>          break;
> +    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
> +        s->hotplug_select = data;
>      default:
>          break;
>      }
> @@ -706,9 +772,6 @@ static void piix4_init_cpu_status(CPUState *cpu, void *data)
>      g->sts[id / 8] |= (1 << (id % 8));
>  }
>  
> -static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> -                                PCIHotplugState state);
> -
>  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
>                                             PCIBus *bus, PIIX4PMState *s)
>  {
> @@ -720,8 +783,6 @@ static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
>                            PCI_HOTPLUG_SIZE);
>      memory_region_add_subregion(parent, PCI_HOTPLUG_ADDR,
>                                  &s->io_pci);
> -    pci_bus_hotplug(bus, piix4_device_hotplug, &s->dev.qdev);
> -
>      qemu_for_each_cpu(piix4_init_cpu_status, &s->gpe_cpu);
>      memory_region_init_io(&s->io_cpu, &cpu_hotplug_ops, s, "apci-cpu-hotplug",
>                            PIIX4_PROC_LEN);
> @@ -730,16 +791,16 @@ static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
>      qemu_register_cpu_added_notifier(&s->cpu_added_notifier);
>  }
>  
> -static void enable_device(PIIX4PMState *s, int slot)
> +static void enable_device(PIIX4PMState *s, unsigned bsel, int slot)
>  {
>      s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
> -    s->pci0_slot_device_present |= (1U << slot);
> +    s->pci_status[bsel].device_present |= (1U << slot);
>  }
>  
> -static void disable_device(PIIX4PMState *s, int slot)
> +static void disable_device(PIIX4PMState *s, unsigned bsel, int slot)
>  {
>      s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
> -    s->pci0_status.down |= (1U << slot);
> +    s->pci_status[bsel].down |= (1U << slot);
>  }
>  
>  static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> @@ -748,19 +809,23 @@ static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
>      int slot = PCI_SLOT(dev->devfn);
>      PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev,
>                                  PCI_DEVICE(qdev));
> +    int bsel = piix4_find_hotplug_bus(s->guest_info, dev->bus);
> +    if (bsel < 0) {
> +        return -1;
> +    }
>  
>      /* Don't send event when device is enabled during qemu machine creation:
>       * it is present on boot, no hotplug event is necessary. We do send an
>       * event when the device is disabled later. */
>      if (state == PCI_COLDPLUG_ENABLED) {
> -        s->pci0_slot_device_present |= (1U << slot);
> +        s->pci_status[bsel].device_present |= (1U << slot);
>          return 0;
>      }
>  
>      if (state == PCI_HOTPLUG_ENABLED) {
> -        enable_device(s, slot);
> +        enable_device(s, bsel, slot);
>      } else {
> -        disable_device(s, slot);
> +        disable_device(s, bsel, slot);
>      }
>  
>      pm_update_sci(s);
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 72606a8..43e4988 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -34,6 +34,7 @@
>  #include "hw/acpi/acpi.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/i386/bios-linker-loader.h"
> +#include "hw/pci/pci_bus.h"
>  
>  #define ACPI_BUILD_APPNAME  "Bochs"
>  #define ACPI_BUILD_APPNAME6 "BOCHS "
> @@ -258,23 +259,358 @@ acpi_encode_len(uint8_t *ssdt_ptr, int length, int bytes)
>  #define ACPI_PROC_SIZEOF (*ssdt_proc_end - *ssdt_proc_start)
>  #define ACPI_PROC_AML (ssdp_proc_aml + *ssdt_proc_start)
>  
> -/* 0x5B 0x82 DeviceOp PkgLength NameString */
> -#define ACPI_PCIHP_OFFSET_HEX (*ssdt_pcihp_name - *ssdt_pcihp_start + 1)
> -#define ACPI_PCIHP_OFFSET_ID (*ssdt_pcihp_id - *ssdt_pcihp_start)
> -#define ACPI_PCIHP_OFFSET_ADR (*ssdt_pcihp_adr - *ssdt_pcihp_start)
> -#define ACPI_PCIHP_OFFSET_EJ0 (*ssdt_pcihp_ej0 - *ssdt_pcihp_start)
> -#define ACPI_PCIHP_SIZEOF (*ssdt_pcihp_end - *ssdt_pcihp_start)
> -#define ACPI_PCIHP_AML (ssdp_pcihp_aml + *ssdt_pcihp_start)
> -
>  #define ACPI_SSDT_SIGNATURE 0x54445353 /* SSDT */
>  #define ACPI_SSDT_HEADER_LENGTH 36
>  
>  #include "hw/i386/ssdt-misc.hex"
> -#include "hw/i386/ssdt-pcihp.hex"
> +
> +static inline GArray *build_alloc_array(void)
> +{
> +        return g_array_new(false, true /* clear */, 1);
> +}
> +
> +static inline void build_free_array(GArray *array)
> +{
> +        g_array_free(array, true);
> +}
> +
> +static inline void build_prepend_byte(GArray *array, uint8_t val)
> +{
> +    g_array_prepend_val(array, val);
> +}
> +
> +static inline void build_append_byte(GArray *array, uint8_t val)
> +{
> +    g_array_append_val(array, val);
> +}
> +
> +static inline void build_prepend_array(GArray *array, GArray *val)
> +{
> +    g_array_prepend_vals(array, val->data, val->len);
> +}
> +
> +static inline void build_append_array(GArray *array, GArray *val)
> +{
> +    g_array_append_vals(array, val->data, val->len);
> +}
> +
> +static void build_append_nameseg(GArray *array, const char *format, ...)
> +{
> +    GString *s = g_string_new("");
> +    va_list args;
> +
> +    va_start(args, format);
> +    g_string_vprintf(s, format, args);
> +    va_end(args);
> +
> +    assert(s->len == 4);
> +    g_array_append_vals(array, s->str, s->len);
> +    g_string_free(s, true);
> +}
> +
> +static void build_prepend_nameseg(GArray *array, const char *format, ...)
> +{
> +    GString *s = g_string_new("");
> +    va_list args;
> +
> +    va_start(args, format);
> +    g_string_vprintf(s, format, args);
> +    va_end(args);
> +
> +    assert(s->len == 4);
> +    g_array_prepend_vals(array, s->str, s->len);
> +    g_string_free(s, true);
> +}
> +
> +static void build_prepend_devfn(GArray *name, uint8_t devfn)
> +{
> +    build_prepend_nameseg(name, "S%0.02x_", devfn);
> +}
> +
> +static void build_append_devfn(GArray *name, uint8_t devfn)
> +{
> +    GArray *array = build_alloc_array();
> +    build_prepend_devfn(array, devfn);
> +    build_append_array(name, array);
> +    build_free_array(array);
> +}
> +
> +static void build_prepend_name_string(GArray *name, PCIBus *bus)
> +{
> +    int segcount;
> +
> +    for (;!pci_bus_is_root(bus); bus = bus->parent_dev->bus) {
> +        build_prepend_devfn(name, bus->parent_dev->devfn);
> +    }
> +    g_array_prepend_vals(name, "PCI0", 4);
> +    g_array_prepend_vals(name, "_SB_", 4);
> +    /* Name segment length is a multiple of 4 */
> +    assert(name->len % 4 == 0);
> +    segcount = name->len / 4;
> +    build_prepend_byte(name, segcount);
> +    build_prepend_byte(name, 0x2f); /* MultiNamePrefix */
> +    build_prepend_byte(name, '\\'); /* RootChar */
> +}
> +
> +static void build_append_name_string(GArray *name, PCIBus *bus)
> +{
> +    GArray *append = build_alloc_array();
> +    build_prepend_name_string(append, bus);
> +    build_append_array(name, append);
> +    build_free_array(append);
> +}
> +
> +/* 5.4 Definition Block Encoding */
> +enum {
> +    PACKAGE_LENGTH_1BYTE_SHIFT = 6, /* Up to 63 - use extra 2 bits. */
> +    PACKAGE_LENGTH_2BYTE_SHIFT = 4,
> +    PACKAGE_LENGTH_3BYTE_SHIFT = 12,
> +    PACKAGE_LENGTH_4BYTE_SHIFT = 20,
> +};
> +
> +static void build_prepend_package_length(GArray *package)
> +{
> +    unsigned bytes;
> +    uint8_t byte;
> +    unsigned package_len = package->len;
> +    unsigned length = package_len;
> +
> +    if (length + 1 < (1 << PACKAGE_LENGTH_1BYTE_SHIFT)) {
> +        byte = length;
> +        build_prepend_byte(package, byte);
> +        return;
> +    }
> +
> +    if (length + 4 > (1 << PACKAGE_LENGTH_4BYTE_SHIFT)) {
> +        byte = length >> PACKAGE_LENGTH_4BYTE_SHIFT;
> +        build_prepend_byte(package, byte);
> +        length &= (1 << PACKAGE_LENGTH_4BYTE_SHIFT) - 1;
> +    }
> +    if (length + 3 > (1 << PACKAGE_LENGTH_3BYTE_SHIFT)) {
> +        byte = length >> PACKAGE_LENGTH_3BYTE_SHIFT;
> +        build_prepend_byte(package, byte);
> +        length &= (1 << PACKAGE_LENGTH_3BYTE_SHIFT) - 1;
> +    }
> +    if (length + 2 > (1 << PACKAGE_LENGTH_2BYTE_SHIFT)) {
> +        byte = length >> PACKAGE_LENGTH_2BYTE_SHIFT;
> +        build_prepend_byte(package, byte);
> +        length &= (1 << PACKAGE_LENGTH_2BYTE_SHIFT) - 1;
> +    }
> +
> +    bytes = package->len - package_len;
> +    byte = (bytes << PACKAGE_LENGTH_1BYTE_SHIFT) | length;
> +    build_prepend_byte(package, byte);
> +}
> +
> +static void build_package(GArray *package, uint8_t op)
> +{
> +    build_prepend_package_length(package);
> +    build_prepend_byte(package, op);
> +}
> +
> +static void build_extop_package(GArray *package, uint8_t op)
> +{
> +    build_package(package, op);
> +    build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
> +}
> +
> +static void build_append_notify(GArray *method, GArray *target_name,
> +                                uint16_t value)
> +{
> +    GArray *notify = build_alloc_array();
> +    uint8_t op = 0xA0; /* IfOp */
> +
> +    build_append_byte(notify, 0x93); /* LEqualOp */
> +    build_append_byte(notify, 0x68); /* Arg0Op */
> +    build_append_byte(notify, 0x0A); /* BytePrefix */
> +    build_append_byte(notify, value);
> +    build_append_byte(notify, value >> 8);
> +    build_append_byte(notify, 0x86); /* NotifyOp */
> +    build_append_array(notify, target_name);
> +    build_append_byte(notify, 0x69); /* Arg1Op */
> +
> +    /* Pack it up */
> +    build_package(notify, op);
> +
> +    build_append_array(method, notify);
> +
> +    build_free_array(notify);
> +}
> +
> +static void
> +build_append_notify_slots(GArray *method, PCIBus *bus,
> +                          unsigned bus_select,
> +                          DECLARE_BITMAP(hotplug_enable, PCI_SLOT_MAX))
> +{
> +    int i;
> +
> +    if (!bus) {
> +        return;
> +    }
> +
> +    for (i = 0; i < 32; i++) {
> +        GArray *target;
> +        if (!test_bit(i, hotplug_enable)) {
> +            continue;
> +        }
> +        target = build_alloc_array();
> +        build_prepend_devfn(target, i);
> +        build_prepend_name_string(target, bus);
> +        /* *32 matches Multiply in PCNF */
> +        build_append_notify(method, target, bus_select * 32 + i);
> +        build_free_array(target);
> +    }
> +}
> +
> +static void build_append_notify_buses(GArray *table, PcGuestInfo *guest_info)
> +{
> +    int i;
> +    GArray *method = build_alloc_array();
> +
> +    uint8_t op = 0x14; /* MethodOp */
> +
> +    build_append_nameseg(method, "PCNT");
> +    build_append_byte(method, 0x02); /* MethodFlags: ArgCount */
> +    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
> +        build_append_notify_slots(method, guest_info->hotplug_buses[i], i,
> +                                  guest_info->hotplug_enable[i]);
> +    }
> +    build_package(method, op);
> +
> +    build_append_array(table, method);
> +    build_free_array(method);
> +}
> +
> +static void build_append_device(GArray *table, unsigned devfn)
> +{
> +    uint8_t op, ej0_op;
> +
> +    GArray *device = build_alloc_array();
> +    GArray *ej0 = build_alloc_array();
> +
> +    op = 0x82; /* DeviceOp */
> +
> +    build_append_devfn(device, devfn);
> +
> +    build_append_byte(device, 0x08); /* NameOp */
> +    build_append_nameseg(device, "_SUN");
> +    build_append_byte(device, 0x0A); /* BytePrefix */
> +    build_append_byte(device, PCI_SLOT(devfn));
> +    build_append_byte(device, 0x08); /* NameOp */
> +    build_append_nameseg(device, "_ADR");
> +    build_append_byte(device, 0x0C); /* DWordPrefix */
> +    build_append_byte(device, PCI_FUNC(devfn));
> +    build_append_byte(device, 0x00);
> +    build_append_byte(device, PCI_SLOT(devfn));
> +    build_append_byte(device, 0x00);
> +
> +    ej0_op = 0x14;  /* MethodOp */
> +    build_append_nameseg(ej0, "_EJ0");
> +    build_append_byte(ej0, 0x01); /* MethodFlags: ArgCount */
> +    build_append_byte(ej0, 0xA4); /* ReturnOp */
> +    build_append_nameseg(ej0, "PCEJ");
> +    build_append_nameseg(ej0, "_SUN");
> +    build_append_nameseg(ej0, "BNUM");
> +    build_package(ej0, ej0_op);
> +    build_append_array(device, ej0);
> +
> +    build_extop_package(device, op);
> +
> +    build_append_array(table, device);
> +    build_free_array(device);
> +    build_free_array(ej0);
> +}
> +
> +static bool build_find_device_bus(PCIBus *buses[GUEST_INFO_MAX_HOTPLUG_BUS],
> +                                  PCIBus *bus, unsigned devfn)
> +{
> +    int i;
> +    for (i = 0; i < ARRAY_SIZE(buses); ++i) {
> +        if (buses[i] && buses[i]->parent_dev &&
> +            buses[i]->parent_dev->bus == bus &&
> +            buses[i]->parent_dev->devfn == devfn) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static void
> +build_append_bus(GArray *table,  int bnum, PcGuestInfo *guest_info)
> +{
> +    uint8_t op;
> +    int slot;
> +    unsigned devfn;
> +    GArray *device;
> +    PCIBus *bus = guest_info->hotplug_buses[bnum];
> +
> +    if (!bus) {
> +        return;
> +    }
> +
> +    device = build_alloc_array();
> +
> +    /* Root device described in DSDT */
> +    if (bnum) {
> +        devfn = bus->parent_dev->devfn;
> +
> +        op = 0x82; /* DeviceOp */
> +
> +        build_append_name_string(device, bus);
> +
> +        build_append_byte(device, 0x08); /* NameOp */
> +        build_append_nameseg(device, "_SUN");
> +        build_append_byte(device, 0x0A); /* BytePrefix */
> +        build_append_byte(device, PCI_SLOT(devfn));
> +        build_append_byte(device, 0x08); /* NameOp */
> +        build_append_nameseg(device, "BNUM");
> +        build_append_byte(device, 0x0A); /* BytePrefix */
> +        build_append_byte(device, bnum);
> +        build_append_byte(device, 0x08); /* NameOp */
> +        build_append_nameseg(device, "_ADR");
> +        build_append_byte(device, 0x0C); /* DWordPrefix */
> +        build_append_byte(device, PCI_FUNC(devfn));
> +        build_append_byte(device, 0x00);
> +        build_append_byte(device, PCI_SLOT(devfn));
> +        build_append_byte(device, 0x00);
> +    } else {
> +        op = 0x10; /* ScopeOp */
> +        build_append_nameseg(device, "PCI0");
> +    }
> +
> +    for (slot = 0; slot < PCI_SLOT_MAX; ++slot) {
> +        if (!test_bit(slot, guest_info->hotplug_enable[bnum])) {
> +            continue;
> +        }
> +        /* Bus device? We'll add it later */
> +        if (build_find_device_bus(guest_info->hotplug_buses,
> +                                  bus, PCI_DEVFN(slot, 0))) {
> +            continue;
> +        }
> +        build_append_device(device, PCI_DEVFN(slot, 0));
> +    }
> +
> +    if (bnum) {
> +        build_extop_package(device, op);
> +    } else {
> +        build_package(device, op);
> +    }
> +
> +    build_append_array(table, device);
> +    build_free_array(device);
> +}
> +
> +static void build_append_device_buses(GArray *table, PcGuestInfo *guest_info)
> +{
> +    int i;
> +    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
> +        build_append_bus(table, i, guest_info);
> +    }
> +}
>  
>  static uint8_t*
>  build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
> -             const char *target, int ofs)
> +             const char *target, int ofs, int step)
>  {
>      int i;
>  
> @@ -295,31 +631,14 @@ build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
>          *(ssdt_ptr++) = i;
>          *(ssdt_ptr++) = 0x86; /* NotifyOp */
>          memcpy(ssdt_ptr, target, 4);
> -        ssdt_ptr[ofs] = acpi_get_hex(i >> 4);
> -        ssdt_ptr[ofs + 1] = acpi_get_hex(i);
> +        ssdt_ptr[ofs] = acpi_get_hex((i * step) >> 4);
> +        ssdt_ptr[ofs + 1] = acpi_get_hex(i * step);
>          ssdt_ptr += 4;
>          *(ssdt_ptr++) = 0x69; /* Arg1Op */
>      }
>      return ssdt_ptr;
>  }
>  
> -static void patch_pcihp(int slot, uint8_t *ssdt_ptr, uint32_t eject)
> -{
> -    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX] = acpi_get_hex(slot >> 4);
> -    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX+1] = acpi_get_hex(slot);
> -    ssdt_ptr[ACPI_PCIHP_OFFSET_ID] = slot;
> -    ssdt_ptr[ACPI_PCIHP_OFFSET_ADR + 2] = slot;
> -
> -    /* Runtime patching of ACPI_EJ0: to disable hotplug for a slot,
> -     * replace the method name: _EJ0 by ACPI_EJ0_. */
> -    /* Sanity check */
> -    assert (!memcmp(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "_EJ0", 4));
> -
> -    if (!eject) {
> -        memcpy(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "EJ0_", 4);
> -    }
> -}
> -
>  static void
>  build_ssdt(GArray *table_data, GArray *linker,
>             FWCfgState *fw_cfg, PcGuestInfo *guest_info)
> @@ -330,11 +649,20 @@ build_ssdt(GArray *table_data, GArray *linker,
>                    + (acpi_cpus * ACPI_PROC_SIZEOF)               /* procs */
>                    + (1+2+5+(12*acpi_cpus))                  /* NTFY */
>                    + (6+2+1+(1*acpi_cpus))                   /* CPON */
> -                  + (1+3+4)                                 /* Scope(PCI0) */
> -                  + ((PCI_SLOT_MAX - 1) * ACPI_PCIHP_SIZEOF)        /* slots */
> -                  + (1+2+5+(12*(PCI_SLOT_MAX - 1))));          /* PCNT */
> -    uint8_t *ssdt = acpi_data_push(table_data, length);
> -    uint8_t *ssdt_ptr = ssdt;
> +                 );
> +    uint8_t *ssdt;
> +    uint8_t *ssdt_ptr;
> +    GArray *devices = build_alloc_array();
> +    GArray *notify = build_alloc_array();
> +
> +    /* TODO: rewrite this function using glib */
> +    build_append_device_buses(devices, guest_info);
> +    length += devices->len;
> +    build_append_notify_buses(notify, guest_info);
> +    length += notify->len;
> +
> +    ssdt = acpi_data_push(table_data, length);
> +    ssdt_ptr = ssdt;
>  
>      /* Copy header and encode fwcfg values in the S3_ / S4_ / S5_ packages */
>      memcpy(ssdt_ptr, ssdp_misc_aml, sizeof(ssdp_misc_aml));
> @@ -389,7 +717,7 @@ build_ssdt(GArray *table_data, GArray *linker,
>  
>      /* build "Method(NTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CP00, Arg1)} ...}" */
>      /* Arg0 = Processor ID = APIC ID */
> -    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2);
> +    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2, 1);
>  
>      /* build "Name(CPON, Package() { One, One, ..., Zero, Zero, ... })" */
>      *(ssdt_ptr++) = 0x08; /* NameOp */
> @@ -411,15 +739,10 @@ build_ssdt(GArray *table_data, GArray *linker,
>      *(ssdt_ptr++) = 'I';
>      *(ssdt_ptr++) = '0';
>  
> -    /* build Device object for each slot */
> -    for (i = 1; i < PCI_SLOT_MAX; i++) {
> -        bool eject = test_bit(i, guest_info->slot_hotplug_enable);
> -        memcpy(ssdt_ptr, ACPI_PCIHP_AML, ACPI_PCIHP_SIZEOF);
> -        patch_pcihp(i, ssdt_ptr, eject);
> -        ssdt_ptr += ACPI_PCIHP_SIZEOF;
> -    }
> -
> -    ssdt_ptr = build_notify(ssdt_ptr, "PCNT", 1, PCI_SLOT_MAX, "S00_", 1);
> +    memcpy(ssdt_ptr, devices->data, devices->len);
> +    ssdt_ptr += devices->len;
> +    memcpy(ssdt_ptr, notify->data, notify->len);
> +    ssdt_ptr += notify->len;
>  
>      build_header(linker, table_data,
>                   (void*)ssdt, ACPI_SSDT_SIGNATURE, ssdt_ptr - ssdt, 1);
> diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
> index 2ab2572..c2e74e9 100644
> --- a/hw/i386/Makefile.objs
> +++ b/hw/i386/Makefile.objs
> @@ -6,7 +6,7 @@ obj-$(CONFIG_XEN) += xen_domainbuild.o xen_machine_pv.o
>  obj-y += kvmvapic.o
>  obj-y += acpi-build.o
>  obj-y += bios-linker-loader.o
> -hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-pcihp.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
> +hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
>  hw/i386/pc_piix.o: hw/i386/pc_piix.c hw/i386/acpi-dsdt.hex
>  hw/i386/pc_q35.o: hw/i386/pc_q35.c hw/i386/q35-acpi-dsdt.hex
>  
> diff --git a/hw/i386/ssdt-pcihp.hex.generated b/hw/i386/ssdt-pcihp.hex.generated
> deleted file mode 100644
> index 0d32a27..0000000
> --- a/hw/i386/ssdt-pcihp.hex.generated
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -static unsigned char ssdt_pcihp_name[] = {
> -0x33
> -};
> -static unsigned char ssdt_pcihp_adr[] = {
> -0x44
> -};
> -static unsigned char ssdt_pcihp_end[] = {
> -0x58
> -};
> -static unsigned char ssdp_pcihp_aml[] = {
> -0x53,
> -0x53,
> -0x44,
> -0x54,
> -0x58,
> -0x0,
> -0x0,
> -0x0,
> -0x1,
> -0x77,
> -0x42,
> -0x58,
> -0x50,
> -0x43,
> -0x0,
> -0x0,
> -0x42,
> -0x58,
> -0x53,
> -0x53,
> -0x44,
> -0x54,
> -0x50,
> -0x43,
> -0x1,
> -0x0,
> -0x0,
> -0x0,
> -0x49,
> -0x4e,
> -0x54,
> -0x4c,
> -0x28,
> -0x5,
> -0x10,
> -0x20,
> -0x10,
> -0x33,
> -0x5c,
> -0x2e,
> -0x5f,
> -0x53,
> -0x42,
> -0x5f,
> -0x50,
> -0x43,
> -0x49,
> -0x30,
> -0x5b,
> -0x82,
> -0x26,
> -0x53,
> -0x41,
> -0x41,
> -0x5f,
> -0x8,
> -0x5f,
> -0x53,
> -0x55,
> -0x4e,
> -0xa,
> -0xaa,
> -0x8,
> -0x5f,
> -0x41,
> -0x44,
> -0x52,
> -0xc,
> -0x0,
> -0x0,
> -0xaa,
> -0x0,
> -0x14,
> -0xf,
> -0x5f,
> -0x45,
> -0x4a,
> -0x30,
> -0x1,
> -0xa4,
> -0x50,
> -0x43,
> -0x45,
> -0x4a,
> -0x5f,
> -0x53,
> -0x55,
> -0x4e
> -};
> -static unsigned char ssdt_pcihp_start[] = {
> -0x30
> -};
> -static unsigned char ssdt_pcihp_id[] = {
> -0x3d
> -};
> -static unsigned char ssdt_pcihp_ej0[] = {
> -0x4a
> -};
> -- 
> MST
Peter Maydell June 10, 2013, 7:17 p.m. UTC | #2
On 10 June 2013 19:58, Anthony Liguori <anthony@codemonkey.ws> wrote:
> We only care about supporting one version of SeaBIOS.  SeaBIOS should
> only care about supporting one version of QEMU.  We're not asking it to
> support multiple versions.

I'm confused, how does this work in cross-version migration?
The BIOS is part of the guest, so I'd expect QEMU would
have to support whatever it happens to be. (Plus how do
you enforce the version-dependency?)

thanks
-- PMM
Laszlo Ersek June 10, 2013, 7:43 p.m. UTC | #3
On 06/10/13 20:58, Anthony Liguori wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
>> This adds support for device hotplug behind
>> pci bridges. Bridge devices themselves need
>> to be pre-configured on qemu command line.
>>
>> One of the goals of this project was to
>> demonstrate the advantages of generating
>> ACPI tables in QEMU.
>> In my opinion, it does this successfully.
> 
> Since you've gone out of your way to make this difficult to actually
> review...

I haven't tried to review the patch yet, but why would you say such a
thing? Michael wants to convince you. There's no way he could sneak past
you in this patch. You surely won't say "it's too hard to review, I'll
yield", and he's obviously aware.

[snip]

>>        * No need for yet another interface to bios to detect qemu version
>>          to check if it's safe to activate new code,
>>          and to ship multiple table versions:
> 
> We only care about supporting one version of SeaBIOS.  SeaBIOS should
> only care about supporting one version of QEMU.  We're not asking it to
> support multiple versions.

Kevin requires cross-compat between {old, new} qemu and {old, new} SeaBIOS:

http://thread.gmane.org/gmane.comp.emulators.qemu/202005/focus=202072

>>          we generated code so we know this is new qemu
>>        * Use of glib's GArray makes it much easier to build
>>          up tables in code without need for iasl and code patching
> 
> Adding a dynamic array to SeaBIOS isn't rocket science.

Please take a look at my series for OVMF that adds basic support for
SMBIOS tables. It took me three days of basically uninterrupted coding
and two approaches to throw away until I got something submittable (with
default tables for only type 0 and type 1).

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/3037

I don't want to invite accusations like "this is a strawman argument,
noone was speaking about SMBIOS", but the selective, dynamic patching is
somewhat similar between AML and SMBIOS in any given boot firmware. You
got a big bunch of named offsets that must be mangled individually.

Allowing the firmware to only install blobs verbatim, or maybe patch
them but only in a completely programmatic manner (ie. without specific
field names & offsets in the firmware, which I did try for SMBIOS in
OVMF, and failed, (*)), would be a big help.

((*) because the fw_cfg format of individual SMBIOS fields doesn't
distinguish formatted fields from strings in the unformatted area)

Not rocket science, just churn and busywork.

[snip]

> So TL;DR, you break a bunch of stuff and introduce a mess of code.  It
> would be less code and wouldn't break anything to add this logic to
> SeaBIOS.
> 
> How is this even a discussion?

Well it isn't for me any more; count me out. I'll go with the flow.

Cheers,
Laszlo
Anthony Liguori June 10, 2013, 7:50 p.m. UTC | #4
Peter Maydell <peter.maydell@linaro.org> writes:

> On 10 June 2013 19:58, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> We only care about supporting one version of SeaBIOS.  SeaBIOS should
>> only care about supporting one version of QEMU.  We're not asking it to
>> support multiple versions.
>
> I'm confused, how does this work in cross-version migration?
> The BIOS is part of the guest, so I'd expect QEMU would
> have to support whatever it happens to be. (Plus how do
> you enforce the version-dependency?)

The BIOS is migrated but reset upon reboot.

Regards,

Anthony Liguori

>
> thanks
> -- PMM
Anthony Liguori June 10, 2013, 7:57 p.m. UTC | #5
Laszlo Ersek <lersek@redhat.com> writes:

> On 06/10/13 20:58, Anthony Liguori wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> 
>>> This adds support for device hotplug behind
>>> pci bridges. Bridge devices themselves need
>>> to be pre-configured on qemu command line.
>>>
>>> One of the goals of this project was to
>>> demonstrate the advantages of generating
>>> ACPI tables in QEMU.
>>> In my opinion, it does this successfully.
>> 
>> Since you've gone out of your way to make this difficult to actually
>> review...
>
> I haven't tried to review the patch yet, but why would you say such a
> thing? Michael wants to convince you. There's no way he could sneak past
> you in this patch. You surely won't say "it's too hard to review, I'll
> yield", and he's obviously aware.

The patch is very large and depends on another large series.  It's not
split up as a logical series of changes and more importantly doesn't
solve significant problems (like live migration).

It makes it very difficult to properly evaluate the approach.  But even
taking those sort cuts, it looks like a lot of code in QEMU to avoid
what appears to be a significantly smaller amount of code in SeaBIOS.

There isn't deep dependency on information only available in QEMU.
AFAICT the only advantage of being in QEMU is being able to use a couple
data types from glib.

> [snip]
>
>>>        * No need for yet another interface to bios to detect qemu version
>>>          to check if it's safe to activate new code,
>>>          and to ship multiple table versions:
>> 
>> We only care about supporting one version of SeaBIOS.  SeaBIOS should
>> only care about supporting one version of QEMU.  We're not asking it to
>> support multiple versions.
>
> Kevin requires cross-compat between {old, new} qemu and {old, new} SeaBIOS:
>
> http://thread.gmane.org/gmane.comp.emulators.qemu/202005/focus=202072
>
>>>          we generated code so we know this is new qemu
>>>        * Use of glib's GArray makes it much easier to build
>>>          up tables in code without need for iasl and code patching
>> 
>> Adding a dynamic array to SeaBIOS isn't rocket science.
>
> Please take a look at my series for OVMF that adds basic support for
> SMBIOS tables. It took me three days of basically uninterrupted coding
> and two approaches to throw away until I got something submittable (with
> default tables for only type 0 and type 1).
>
> http://thread.gmane.org/gmane.comp.bios.tianocore.devel/3037
>
> I don't want to invite accusations like "this is a strawman argument,
> noone was speaking about SMBIOS", but the selective, dynamic patching is
> somewhat similar between AML and SMBIOS in any given boot firmware. You
> got a big bunch of named offsets that must be mangled individually.
>
> Allowing the firmware to only install blobs verbatim, or maybe patch
> them but only in a completely programmatic manner (ie. without specific
> field names & offsets in the firmware, which I did try for SMBIOS in
> OVMF, and failed, (*)), would be a big help.
>
> ((*) because the fw_cfg format of individual SMBIOS fields doesn't
> distinguish formatted fields from strings in the unformatted area)

I don't understand this piece.

Other than TianoCore being a weird environment, what made this more
difficult than it is to generate the tables in QEMU?

Regards,

Anthony Liguori

>
> Not rocket science, just churn and busywork.
>
> [snip]
>
>> So TL;DR, you break a bunch of stuff and introduce a mess of code.  It
>> would be less code and wouldn't break anything to add this logic to
>> SeaBIOS.
>> 
>> How is this even a discussion?
>
> Well it isn't for me any more; count me out. I'll go with the flow.
>
> Cheers,
> Laszlo
Laszlo Ersek June 10, 2013, 8:43 p.m. UTC | #6
On 06/10/13 21:57, Anthony Liguori wrote:
> Laszlo Ersek <lersek@redhat.com> writes:

>> Please take a look at my series for OVMF that adds basic support for
>> SMBIOS tables. It took me three days of basically uninterrupted coding
>> and two approaches to throw away until I got something submittable (with
>> default tables for only type 0 and type 1).
>>
>> http://thread.gmane.org/gmane.comp.bios.tianocore.devel/3037
>>
>> I don't want to invite accusations like "this is a strawman argument,
>> noone was speaking about SMBIOS", but the selective, dynamic patching is
>> somewhat similar between AML and SMBIOS in any given boot firmware. You
>> got a big bunch of named offsets that must be mangled individually.
>>
>> Allowing the firmware to only install blobs verbatim, or maybe patch
>> them but only in a completely programmatic manner (ie. without specific
>> field names & offsets in the firmware, which I did try for SMBIOS in
>> OVMF, and failed, (*)), would be a big help.
>>
>> ((*) because the fw_cfg format of individual SMBIOS fields doesn't
>> distinguish formatted fields from strings in the unformatted area)
> 
> I don't understand this piece.
> 
> Other than TianoCore being a weird environment, what made this more
> difficult than it is to generate the tables in QEMU?

QEMU can pass down two kinds of SMBIOS config items over fw_cfg:
verbatim tables and individual fields.

The verbatim table kind means "install this table as-is". I'm not sure
how SeaBIOS solves it, in OVMF you call Smbios->Add().

Regarding the "field kind" config item: for each table type required by
the SMBIOS spec, if qemu doesn't provide such a table in whole / in
verbatim, then the firmware must install a default table for that type,
*and* patch it with any referring field kind config items.

Given the current format of the "field kind" config item, it is
currently not possible to just iterate over the "field kind" config
items, and patch whatever field in whatever table they refer to. The
information contained in the "field kind" config item is insufficient to
do this.

The "field kind" config item designates a (table, field offset) pair. If
that offset points at a string-typed SMBIOS field (basically string
serial number into the unformatted area), one must update the
unformatted string area after the actual table fields. (In OVMF one does
this with a different call, Smbios->UpdateStrings(), and at that point
the table even must have been installed already.)

If the offset points at a non-string-typed (= in-formatted-area) field,
the field must be overwritten in-place. (In OVMF this must happen
*before* table installation, ie. before Smbios->Add().)

So, the current fw_cfg representation is insufficient, and the firmware
must extend (or qemu should have extended) the info contents with this
distinction between formatted and unformatted. Again this differs from
field to field and the origin of that classification is the SMBIOS spec.

Until this point in my email there has been no clear indication whether
qemu or the firmware should do this. *Some* side will have to suffer
shoveling the field/offset info, and a table template for each type,
from the SMBIOS spec into code. I described the above solely as
background, to set the landscape.

(a) The first thing that tips the scale is: of qemu there is one, and of
firmware, there exist at least two, *wildly different* implementations.
(Apologies, I forgot about coreboot; not sure if it's affected.) It's
not a design purity argument, just what's cheaper.

(b) The second thing that further unbalances the scale: suppose you
decide to add further field-kind overrides to qemu. For example,
currently Type 3 (chassis information, table required by SMBIOS spec)
cannot be overridden field-wise.

If a user is unhappy with her firmware's default Type 3 table, she's
allowed to hex-edit a Type 3 blob, and pass it with "-smbios
file=type3.blob" on the qemu command line. If you as developer want to
provide her with more flexibility, you need to
(i) patch qemu (of course), to recognize individual fields on the
command line, and to populate fw_cfg with them,
(ii) patch *every single firmware* supported by qemu to *look for* those
(type=3,offset) field-kind config entries, and to effectuate them.

If you pass down only verbatim tables, then (ii) falls away. You don't
have to update SeaBIOS, nor OVMF. As a bonus, you don't have to care
about versions: age-old firmware that only knows how to install verbatim
tables will immediately benefit from qemu's new-fangled Type 3 flexibility.

Case in point: <https://bugzilla.redhat.com/show_bug.cgi?id=788142>.
(Non-public bug, sorry about that. I think I can disclose that it is
about supporting the "field kind" config entry for Type 3 tables.)


For ACPI tables, especially for tables containing AML, multiply the cost
of (ii) by twenty. Many more offsets to patch and vastly more
dependencies via fw_cfg.

Laszlo
Laszlo Ersek June 10, 2013, 9:14 p.m. UTC | #7
Sorry about replying to myself...

On 06/10/13 22:43, Laszlo Ersek wrote:

> It's not a design purity argument, just what's cheaper.

You probably consider SMBIOS/ACPI table generation "guest code" that has
no place in the machine emulator. You're likely willing to expose the
needed bits on a case-by-case basis, in their barest representation.

I must consider the entire stack, from the libvirt top, down to the
guest kernel or root shell, and see where it is cheapest to effect
changes. The higher I can hoist something the better.

Modulo, it depends on what element of the stack people think of as
central / indispensable -- that element should support the feature directly.

(Returning to my earlier example, it would be theoretically viable *I
guess* to equip libvirt to pre-format a Type 3 SMBIOS blob in a file,
and pass it with "-smbios file=...", but many people don't use libvirt.
It also might not apply to other emulators supported by libvirt.)

Thanks,
Laszlo
Anthony Liguori June 10, 2013, 9:45 p.m. UTC | #8
Laszlo Ersek <lersek@redhat.com> writes:

> On 06/10/13 21:57, Anthony Liguori wrote:
>> Laszlo Ersek <lersek@redhat.com> writes:
>
>> 
>> I don't understand this piece.
>> 
>> Other than TianoCore being a weird environment, what made this more
>> difficult than it is to generate the tables in QEMU?
>
> QEMU can pass down two kinds of SMBIOS config items over fw_cfg:
> verbatim tables and individual fields.
>
> The verbatim table kind means "install this table as-is". I'm not sure
> how SeaBIOS solves it, in OVMF you call Smbios->Add().
>
> Regarding the "field kind" config item: for each table type required by
> the SMBIOS spec, if qemu doesn't provide such a table in whole / in
> verbatim, then the firmware must install a default table for that type,
> *and* patch it with any referring field kind config items.
>
> Given the current format of the "field kind" config item, it is
> currently not possible to just iterate over the "field kind" config
> items, and patch whatever field in whatever table they refer to. The
> information contained in the "field kind" config item is insufficient to
> do this.
>
> The "field kind" config item designates a (table, field offset) pair. If
> that offset points at a string-typed SMBIOS field (basically string
> serial number into the unformatted area), one must update the
> unformatted string area after the actual table fields. (In OVMF one does
> this with a different call, Smbios->UpdateStrings(), and at that point
> the table even must have been installed already.)
>
> If the offset points at a non-string-typed (= in-formatted-area) field,
> the field must be overwritten in-place. (In OVMF this must happen
> *before* table installation, ie. before Smbios->Add().)

Okay, I understand the problem here.  Thanks for the explanation.  I had
assumed that the smbios_field structure was the binary representation of
the table.  I didn't realize it was our own format used to pass
information over.

> So, the current fw_cfg representation is insufficient, and the firmware
> must extend (or qemu should have extended) the info contents with this
> distinction between formatted and unformatted. Again this differs from
> field to field and the origin of that classification is the SMBIOS spec.
>
> Until this point in my email there has been no clear indication whether
> qemu or the firmware should do this. *Some* side will have to suffer
> shoveling the field/offset info, and a table template for each type,
> from the SMBIOS spec into code. I described the above solely as
> background, to set the landscape.
>
> (a) The first thing that tips the scale is: of qemu there is one, and of
> firmware, there exist at least two, *wildly different* implementations.
> (Apologies, I forgot about coreboot; not sure if it's affected.) It's
> not a design purity argument, just what's cheaper.

The only reason there are two implementations is a licensing problem
that must be solved anyway.

> (b) The second thing that further unbalances the scale: suppose you
> decide to add further field-kind overrides to qemu. For example,
> currently Type 3 (chassis information, table required by SMBIOS spec)
> cannot be overridden field-wise.

The reason we expose blobs in SMBIOS is that blobs can be vendor
specific and there was a strong desire to allow vendor specific blobs to
be passed through.  There really is no choice but to expose a blob
interface for that.

> If a user is unhappy with her firmware's default Type 3 table, she's
> allowed to hex-edit a Type 3 blob, and pass it with "-smbios
> file=type3.blob" on the qemu command line. If you as developer want to
> provide her with more flexibility, you need to
> (i) patch qemu (of course), to recognize individual fields on the
> command line, and to populate fw_cfg with them,
> (ii) patch *every single firmware* supported by qemu to *look for* those
> (type=3,offset) field-kind config entries, and to effectuate them.
>
> If you pass down only verbatim tables, then (ii) falls away. You don't
> have to update SeaBIOS, nor OVMF.

OVMF is proprietary.  It is not "supported" by QEMU.  Firmware is a
fundamental part of our stack.  I'm not really convinced that
QEMU<->firmware is a GPL boundary because of how tightly the two are
linked.

Moving large chunks of firmware code into QEMU just to avoid solving
licensing issues is a non-starter with me.

> As a bonus, you don't have to care
> about versions: age-old firmware that only knows how to install verbatim
> tables will immediately benefit from qemu's new-fangled Type 3
> flexibility.

This presume that (1) you only care about exposing this information to
x86 guests (2) that the firmware will never have any need to patch these
blobs.

> Case in point: <https://bugzilla.redhat.com/show_bug.cgi?id=788142>.
> (Non-public bug, sorry about that. I think I can disclose that it is
> about supporting the "field kind" config entry for Type 3 tables.)

This information is *exactly* the kind of information that you want to
be exposed across architectures.  It makes lots of sense to model a
device to expose this type of information in a structure way.

This discussion comes down to two things I think: (a) our existing
firmware interface is pretty poor (b) we are duplicating work because of
firmware licensing.

We can fix (a) and there's lots of value in doing that in terms of
improving support for other architectures.  We've discussed (b) in other
threads and I've stated my opinion on the direction we need to take.

Regards,

Anthony Liguori

>
>
> For ACPI tables, especially for tables containing AML, multiply the cost
> of (ii) by twenty. Many more offsets to patch and vastly more
> dependencies via fw_cfg.
>
> Laszlo
Kevin O'Connor June 10, 2013, 11:05 p.m. UTC | #9
On Mon, Jun 10, 2013 at 04:45:53PM -0500, Anthony Liguori wrote:
> This discussion comes down to two things I think: (a) our existing
> firmware interface is pretty poor (b) we are duplicating work because of
> firmware licensing.
> 
> We can fix (a) and there's lots of value in doing that in terms of
> improving support for other architectures.  We've discussed (b) in other
> threads and I've stated my opinion on the direction we need to take.

I'm not concerned about (b).

I'm quite curious how you are planning to solve (a).  I think it would
help move this conversation forward if you could take a couple acpi
tables in use today (eg, madt, srat) and describe the future format
and location for each field in those tables.  I think it would also be
useful if you could do the same for a couple DSDT entries (eg,
\_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
guest build the AML from that info.

-Kevin
Anthony Liguori June 10, 2013, 11:34 p.m. UTC | #10
Kevin O'Connor <kevin@koconnor.net> writes:

> On Mon, Jun 10, 2013 at 04:45:53PM -0500, Anthony Liguori wrote:
>> This discussion comes down to two things I think: (a) our existing
>> firmware interface is pretty poor (b) we are duplicating work because of
>> firmware licensing.
>> 
>> We can fix (a) and there's lots of value in doing that in terms of
>> improving support for other architectures.  We've discussed (b) in other
>> threads and I've stated my opinion on the direction we need to take.
>
> I'm not concerned about (b).
>
> I'm quite curious how you are planning to solve (a).  I think it would
> help move this conversation forward if you could take a couple acpi
> tables in use today (eg, madt, srat) and describe the future format
> and location for each field in those tables.

Sure.  Today we expose the device model through a graph that can be
losely mapped to a filesystem.  We expose two methods of interest as
part of our management api:

{ 'type': 'ObjectPropertyInfo',
  'data': { 'name': 'str', 'type': 'str' } }

{ 'command': 'qom-list',
  'data': { 'path': 'str' },
  'returns': [ 'ObjectPropertyInfo' ] }

{ 'command': 'qom-get',
  'data': { 'path': 'str', 'property': 'str' },
  'returns': 'visitor',
  'gen': 'no' }

I am proposing that we replace fw_cfg with a device that exports the QOM
tree (read-only) via firmware.

We have a utility called qom-fuse that can mount a fuse filesystem that
exports this as a proper filesystem too.

Here's a small example:

[06:26 PM] anthony🐵 titi:~/git/qemu/QMP/tmp$ find
.
./machine
./machine/i440fx
./machine/i440fx/ioapic
./machine/i440fx/ioapic/parent_bus
./machine/i440fx/ioapic/realized
./machine/i440fx/ioapic/type
./machine/i440fx/pci.0
./machine/i440fx/pci.0/child[5]
./machine/i440fx/pci.0/child[4]
./machine/i440fx/pci.0/child[3]
./machine/i440fx/pci.0/child[2]
./machine/i440fx/pci.0/child[1]
./machine/i440fx/pci.0/child[0]
...

For the MADT table, right now SeaBIOS needs the CPU count.  That can be
found by counting the number of CPU nodes.  Today cpus are unattached so
they appear in /machine/unattached but we should put them in a
/machine/cpu container for clarity.

Presumably it would be good to expose the apic id too.  This is already
in the QOM tree.

QOM is the full representation of the device state so we should not ever
need to add additional things to fw_cfg.  More likely than not, when
SeaBIOS needs more information, it's already there so added
functionality will probably Just Work with older QEMUs.

> I think it would also be
> useful if you could do the same for a couple DSDT entries (eg,
> \_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
> guest build the AML from that info.

Likewise the slot count should be available too.  We hard code slots
today but it is something we should model properly.  We would model it
using QOM of course.

Internally within QEMU, this initial discussion started by saying that
any ACPI generation within QEMU should happen strictly with QOM
methods.  This was the crux of my argument, if QOM is the only interface
we need, everything is there for the firmware to do the same job that
QEMU would be doing.

Regards,

Anthony Liguori

>
> -Kevin
David Woodhouse June 10, 2013, 11:52 p.m. UTC | #11
On Mon, 2013-06-10 at 18:34 -0500, Anthony Liguori wrote:
> Internally within QEMU, this initial discussion started by saying that
> any ACPI generation within QEMU should happen strictly with QOM
> methods.  This was the crux of my argument, if QOM is the only
> interface we need, everything is there for the firmware to do the same
> job that QEMU would be doing.

That's nice in theory, but I'm not sure how it works as things evolve
and new things / new features get exposed. The firmware's
*interpretation* of the QOM tree needs to be kept in sync with qemu.

Hm, make that: The firmwares' *interpretation*…

Let's take a specific, recent example. We fixed the PIIX4 code to
actually handle the hard reset on port 0xcf9. We need to fix the ACPI
tables to indicate a usable RESET_REG.

How is that exposed in the QOM tree, and how does it all work? With qemu
exposing ACPI tables in their close-to-final form, it's just fine. Boot
with a recent qemu and it's all nice and shiny; boot with an old qemu
and it doesn't reset properly.

But if the firmware has to be updated to interpret the new feature
advertised in the QOM tree and translate it into the ACPI table, then we
haven't really got much of an improvement.

Please explain how this is supposed to work in *practice*.
Anthony Liguori June 11, 2013, 12:11 a.m. UTC | #12
David Woodhouse <dwmw2@infradead.org> writes:

> On Mon, 2013-06-10 at 18:34 -0500, Anthony Liguori wrote:
>> Internally within QEMU, this initial discussion started by saying that
>> any ACPI generation within QEMU should happen strictly with QOM
>> methods.  This was the crux of my argument, if QOM is the only
>> interface we need, everything is there for the firmware to do the same
>> job that QEMU would be doing.
>
> That's nice in theory, but I'm not sure how it works as things evolve
> and new things / new features get exposed. The firmware's
> *interpretation* of the QOM tree needs to be kept in sync with qemu.
>
> Hm, make that: The firmwares' *interpretation*…
>
> Let's take a specific, recent example. We fixed the PIIX4 code to
> actually handle the hard reset on port 0xcf9. We need to fix the ACPI
> tables to indicate a usable RESET_REG.

Very good example.

Normally, we try to be bug-for-bug compatible for guests such that -M
pc-1.4 would behave exactly the same.

In this case, we failed to introduce a property to control this behavior
like we should have.  If this changes the guest ACPI tables, it
definitely should definitely be set based on a property.

This is a good example of why this approach is good for QEMU, it helps
us catch stuff like this.

> How is that exposed in the QOM tree, and how does it all work? With qemu
> exposing ACPI tables in their close-to-final form, it's just fine. Boot
> with a recent qemu and it's all nice and shiny; boot with an old qemu
> and it doesn't reset properly.

If we did this right in QEMU, we'd have to introduce a QOM property
anyway as that's how we trigger differences in machine behavior.  And -M
pc-1.4 ought to generate the same tables as qemu 1.4 did.

Regards,

Anthony Liguori

> But if the firmware has to be updated to interpret the new feature
> advertised in the QOM tree and translate it into the ACPI table, then we
> haven't really got much of an improvement.
>
> Please explain how this is supposed to work in *practice*.
>
> -- 
> dwmw2
Kevin O'Connor June 11, 2013, 12:23 a.m. UTC | #13
On Mon, Jun 10, 2013 at 06:34:29PM -0500, Anthony Liguori wrote:
> Kevin O'Connor <kevin@koconnor.net> writes:
> 
> > On Mon, Jun 10, 2013 at 04:45:53PM -0500, Anthony Liguori wrote:
> >> This discussion comes down to two things I think: (a) our existing
> >> firmware interface is pretty poor (b) we are duplicating work because of
> >> firmware licensing.
> >> 
> >> We can fix (a) and there's lots of value in doing that in terms of
> >> improving support for other architectures.  We've discussed (b) in other
> >> threads and I've stated my opinion on the direction we need to take.
> >
> > I'm not concerned about (b).
> >
> > I'm quite curious how you are planning to solve (a).  I think it would
> > help move this conversation forward if you could take a couple acpi
> > tables in use today (eg, madt, srat) and describe the future format
> > and location for each field in those tables.
> 
> Sure.  Today we expose the device model through a graph that can be
> losely mapped to a filesystem.  We expose two methods of interest as
> part of our management api:
[...]
> For the MADT table, right now SeaBIOS needs the CPU count.  That can be
> found by counting the number of CPU nodes.  Today cpus are unattached so
> they appear in /machine/unattached but we should put them in a
> /machine/cpu container for clarity.

SeaBIOS needs much more than the CPU count.  The madt contains info on
each interrupt - its global irq number, the legacy irq number, the
acpi defined type of the irq, and the acpi defined flags for the irq.
It also contains similar info on each cpu (including apic id), the io
apic, and NMIs.

> QOM is the full representation of the device state so we should not ever
> need to add additional things to fw_cfg.  More likely than not, when
> SeaBIOS needs more information, it's already there so added
> functionality will probably Just Work with older QEMUs.
> 
> > I think it would also be
> > useful if you could do the same for a couple DSDT entries (eg,
> > \_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
> > guest build the AML from that info.
> 
> Likewise the slot count should be available too.  We hard code slots
> today but it is something we should model properly.  We would model it
> using QOM of course.
> 
> Internally within QEMU, this initial discussion started by saying that
> any ACPI generation within QEMU should happen strictly with QOM
> methods.  This was the crux of my argument, if QOM is the only interface
> we need, everything is there for the firmware to do the same job that
> QEMU would be doing.

I think we keep talking past each other.  You seem to be pointing out
that the information seabios uses to patch its hardcoded tables can be
passed in via QOM.  Agreed, it can.  I'm pointing out all the info
that is hardcoded in seabios - I don't see how that can be passed via
QOM.

All the hardcoded data in seabios is a problem, because when it
changes (and it frequently does) it requires changes to both QEMU and
SeaBIOS and those changes have to be coordinated.  The key reason for
moving the tables into QEMU is not that it can better patch the tables
- the advantage is that it can hardcode the tables that need patching.

I can cite several recent examples of seabios change requests that
require changing of the hardcoded tables: liguang wants to add an
"embedded controller" hardware device which requires changes to
seabios' dsdt, Corey Minyard wants to add IPMI hardware support (both
smbios changes and DSDT changes), and Corey Bryant wants to add TPM
hardware support.

How do we solve the problem of seabios having a ton of hardcoded
information about the qemu hardware, and seabios having to change with
the hardware modifications that qemu makes?

-Kevin
Jordan Justen June 11, 2013, 12:28 a.m. UTC | #14
On Mon, Jun 10, 2013 at 2:45 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> OVMF is proprietary.

I don't agree that not-OSI means proprietary.

I agree that the FAT driver is not 'free software' and I agree that is
a problem for usage with free software projects, such as QEMU. This is
a big deal, but unfortunately, as an Intel employee, I think I've done
as much as I can to address this.

It couldn't hurt if more people that actually care about this spoke up
on edk2-devel about the issue, or perhaps within a UEFI working group.
Because, I know that they've stopped listening to me about it.

> It is not "supported" by QEMU.

No, but I've always thought that QEMU was happy to have alternative
firmware projects.

> I'm not really convinced that
> QEMU<->firmware is a GPL boundary because of how tightly the two are
> linked.

Where has 'linked' in terms of the GPL ever been anything other than
actual executable linking?

> Moving large chunks of firmware code into QEMU just to avoid solving
> licensing issues is a non-starter with me.

Is this a licensing issue? I thought this was a "let's save time by
doing it in one place" thing. I'm pretty ambivalent about this
feature, really. I don't think it is even worth all this bickering.

I'm certain OVMF has ACPI issues on QEMU, but I don't think it is a
huge deal to resolve them independently of this feature.

I was not a huge fan of supporting this type of thing for Xen in OVMF,
but it does seem to work fine.

-Jordan
Anthony Liguori June 11, 2013, 12:51 a.m. UTC | #15
Hi Kevin,

On Mon, Jun 10, 2013 at 7:23 PM, Kevin O'Connor <kevin@koconnor.net> wrote:
> On Mon, Jun 10, 2013 at 06:34:29PM -0500, Anthony Liguori wrote:
>> Kevin O'Connor <kevin@koconnor.net> writes:
>>
>> For the MADT table, right now SeaBIOS needs the CPU count.  That can be
>> found by counting the number of CPU nodes.  Today cpus are unattached so
>> they appear in /machine/unattached but we should put them in a
>> /machine/cpu container for clarity.
>
> SeaBIOS needs much more than the CPU count.  The madt contains info on
> each interrupt - its global irq number, the legacy irq number, the
> acpi defined type of the irq, and the acpi defined flags for the irq.
> It also contains similar info on each cpu (including apic id), the io
> apic, and NMIs.

See below.

>> QOM is the full representation of the device state so we should not ever
>> need to add additional things to fw_cfg.  More likely than not, when
>> SeaBIOS needs more information, it's already there so added
>> functionality will probably Just Work with older QEMUs.
>>
>> > I think it would also be
>> > useful if you could do the same for a couple DSDT entries (eg,
>> > \_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
>> > guest build the AML from that info.
>>
>> Likewise the slot count should be available too.  We hard code slots
>> today but it is something we should model properly.  We would model it
>> using QOM of course.
>>
>> Internally within QEMU, this initial discussion started by saying that
>> any ACPI generation within QEMU should happen strictly with QOM
>> methods.  This was the crux of my argument, if QOM is the only interface
>> we need, everything is there for the firmware to do the same job that
>> QEMU would be doing.
>
> I think we keep talking past each other.  You seem to be pointing out
> that the information seabios uses to patch its hardcoded tables can be
> passed in via QOM.  Agreed, it can.  I'm pointing out all the info
> that is hardcoded in seabios - I don't see how that can be passed via
> QOM.

No, we aren't talking past each other.  We both have the same goal.

A lot of what is hard coded in SeaBIOS is hard coded in QEMU too.  IRQ
routing is a good example of that.  We want to make this information
dynamic.

Part of the trouble is that we haven't had a need to not hard code it
because this is only information consumed by the BIOS.

> All the hardcoded data in seabios is a problem, because when it
> changes (and it frequently does) it requires changes to both QEMU and
> SeaBIOS and those changes have to be coordinated.  The key reason for
> moving the tables into QEMU is not that it can better patch the tables
> - the advantage is that it can hardcode the tables that need patching.

So let's fix it.  It's very easy to add read-only properties to QOM so
we can hard code the bits that are in SeaBIOS now as read-only
properties.  For the MADT table, the per-CPU and IOAPIC info can
simply be properties of those devices.  We already represent irqs in
QOM as integers so I suspect much of the information SeaBIOS needs is
already there.

> I can cite several recent examples of seabios change requests that
> require changing of the hardcoded tables: liguang wants to add an
> "embedded controller" hardware device which requires changes to
> seabios' dsdt, Corey Minyard wants to add IPMI hardware support (both
> smbios changes and DSDT changes), and Corey Bryant wants to add TPM
> hardware support.
>
> How do we solve the problem of seabios having a ton of hardcoded
> information about the qemu hardware, and seabios having to change with
> the hardware modifications that qemu makes?

I think that we can pretty much touch a table once pulling all of the
info from QOM and then from a SeaBIOS point of view, never have to
touch it again.

The benefit is that solving this problem for x86 solves it for other
architectures too and also lays the ground work to let a user actually
control these bits too.

Regards,

Anthony Liguori

> -Kevin
Anthony Liguori June 11, 2013, 1:03 a.m. UTC | #16
Hi Jordan,

On Mon, Jun 10, 2013 at 7:28 PM, Jordan Justen <jljusten@gmail.com> wrote:
> On Mon, Jun 10, 2013 at 2:45 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> OVMF is proprietary.
>
> I don't agree that not-OSI means proprietary.

I call it blue if that makes you all feel better :-)  But no matter
what the word we use it, the point still stands, we need a UEFI
firmware that has a OSI-approved license.  Forgetting everything else,
the distros can't ship OVMF in its current form.

> I agree that the FAT driver is not 'free software' and I agree that is
> a problem for usage with free software projects, such as QEMU. This is
> a big deal, but unfortunately, as an Intel employee, I think I've done
> as much as I can to address this.
>
> It couldn't hurt if more people that actually care about this spoke up
> on edk2-devel about the issue, or perhaps within a UEFI working group.
> Because, I know that they've stopped listening to me about it.

Is this useful?  I can try to make noise.  I assume since folks like
you who have much more credibility and familiarity in this space have
given up, it's a lost cause.

>> It is not "supported" by QEMU.
>
> No, but I've always thought that QEMU was happy to have alternative
> firmware projects.

We are and we're happy to accept patches to enable things even if its
proprietary.  But that's all assuming we're improving hardware
emulation.

What we're talking about here is doing something that's very un-hardware like.

>> I'm not really convinced that
>> QEMU<->firmware is a GPL boundary because of how tightly the two are
>> linked.
>
> Where has 'linked' in terms of the GPL ever been anything other than
> actual executable linking?

I should not have even brought this up as it's not worth debating.  If
you're curious,
http://www.gnu.org/licenses/gpl-faq.html#MereAggregation

>> Moving large chunks of firmware code into QEMU just to avoid solving
>> licensing issues is a non-starter with me.
>
> Is this a licensing issue? I thought this was a "let's save time by
> doing it in one place" thing. I'm pretty ambivalent about this
> feature, really. I don't think it is even worth all this bickering.
>
> I'm certain OVMF has ACPI issues on QEMU, but I don't think it is a
> huge deal to resolve them independently of this feature.

Doing it one place could just mean share code.  Code can't be shared
today, that's why licensing has come up.

Regards,

Anthony Liguori

> I was not a huge fan of supporting this type of thing for Xen in OVMF,
> but it does seem to work fine.
>
> -Jordan
Kevin O'Connor June 11, 2013, 1:19 a.m. UTC | #17
On Mon, Jun 10, 2013 at 07:51:55PM -0500, Anthony Liguori wrote:
> I think that we can pretty much touch a table once pulling all of the
> info from QOM and then from a SeaBIOS point of view, never have to
> touch it again.

Thanks.  I do think it would help if you could go through the details
of a couple of tables and a couple of DSDT entries and walk through
where each field will come from.  And also describe how seabios will
build the resulting AML.  From my previous email:

> I'm quite curious how you are planning to solve (a).  I think it would
> help move this conversation forward if you could take a couple acpi
> tables in use today (eg, madt, srat) and describe the future format
> and location for each field in those tables.  I think it would also be
> useful if you could do the same for a couple DSDT entries (eg,
> \_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
> guest build the AML from that info.

I do understand your desire to pass this stuff as parameters, but I
really don't see it as feasible.  I'm hoping that if you can write up
some examples with specifics you'll either enlighten me or you'll see
the difficulties I'm trying to describe.

-Kevin
Anthony Liguori June 11, 2013, 1:25 a.m. UTC | #18
Hi Kevin,

On Mon, Jun 10, 2013 at 8:19 PM, Kevin O'Connor <kevin@koconnor.net> wrote:
> On Mon, Jun 10, 2013 at 07:51:55PM -0500, Anthony Liguori wrote:
>> I think that we can pretty much touch a table once pulling all of the
>> info from QOM and then from a SeaBIOS point of view, never have to
>> touch it again.
>
> Thanks.  I do think it would help if you could go through the details
> of a couple of tables and a couple of DSDT entries and walk through
> where each field will come from.  And also describe how seabios will
> build the resulting AML.  From my previous email:
>
>> I'm quite curious how you are planning to solve (a).  I think it would
>> help move this conversation forward if you could take a couple acpi
>> tables in use today (eg, madt, srat) and describe the future format
>> and location for each field in those tables.  I think it would also be
>> useful if you could do the same for a couple DSDT entries (eg,
>> \_SB.PCI0, \_SB.PCI0.ISA) and also describe how you plan to have the
>> guest build the AML from that info.
>
> I do understand your desire to pass this stuff as parameters, but I
> really don't see it as feasible.  I'm hoping that if you can write up
> some examples with specifics you'll either enlighten me or you'll see
> the difficulties I'm trying to describe.

Is your basic argument that generating these tables is hard, QEMU can
and should just hard code them and change them whenever it feels like
without involving SeaBIOS?

If that's what it boils down to, I'll look more closely at it.

Regards,

Anthony Liguori

> -Kevin
Jordan Justen June 11, 2013, 1:32 a.m. UTC | #19
On Mon, Jun 10, 2013 at 6:03 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On Mon, Jun 10, 2013 at 7:28 PM, Jordan Justen <jljusten@gmail.com> wrote:
>> It couldn't hurt if more people that actually care about this spoke up
>> on edk2-devel about the issue, or perhaps within a UEFI working group.
>> Because, I know that they've stopped listening to me about it.
>
> Is this useful?  I can try to make noise.

Possibly. But, I don't want to give you false hope. :)

> I assume since folks like
> you who have much more credibility and familiarity in this space have
> given up, it's a lost cause.

I think customers/partners reporting an issue directly can sometime be
more persuasive than a lone engineer saying this somewhat obscure
licensing thing will be an issue for free software projects.

>> No, but I've always thought that QEMU was happy to have alternative
>> firmware projects.
>
> We are and we're happy to accept patches to enable things even if its
> proprietary.  But that's all assuming we're improving hardware
> emulation.
>
> What we're talking about here is doing something that's very un-hardware like.

Fair enough. I know that it's been done by other VMs, but you
certainly have a point in that real systems don't do this.

-Jordan
Kevin O'Connor June 11, 2013, 1:49 a.m. UTC | #20
On Mon, Jun 10, 2013 at 08:25:15PM -0500, Anthony Liguori wrote:
> On Mon, Jun 10, 2013 at 8:19 PM, Kevin O'Connor <kevin@koconnor.net> wrote:
> > I do understand your desire to pass this stuff as parameters, but I
> > really don't see it as feasible.  I'm hoping that if you can write up
> > some examples with specifics you'll either enlighten me or you'll see
> > the difficulties I'm trying to describe.
> 
> Is your basic argument that generating these tables is hard, QEMU can
> and should just hard code them and change them whenever it feels like
> without involving SeaBIOS?

Yes - that's my basic argument.  Generating the content in a way that
is future proof (or future flexible) is very hard.  The DSDT is
particularly painful.  I'm also afraid of introducing a lot of
complexity when simply hardcoding the content (in the right place) is
the better option.

> If that's what it boils down to, I'll look more closely at it.

Thanks,
-Kevin
Michael S. Tsirkin June 11, 2013, 5:22 a.m. UTC | #21
On Mon, Jun 10, 2013 at 01:58:41PM -0500, Anthony Liguori wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > This adds support for device hotplug behind
> > pci bridges. Bridge devices themselves need
> > to be pre-configured on qemu command line.
> >
> > One of the goals of this project was to
> > demonstrate the advantages of generating
> > ACPI tables in QEMU.
> > In my opinion, it does this successfully.
> 
> Since you've gone out of your way to make this difficult to actually
> review...

What's wrong?

> >
> >        * This saves the need to provide a new interface for firmware
> >          to discover bus number to pci brige mapping
> 
> Do you mean fw_cfg?  The BIOS normally does bus numbering.  I see no
> reason for it not to.

I did my best to explain this. ACPI spec explicitly states
config access to devices not on bus 0 is forbidden.
This is to allow OS to renumber the bus at any time.
So you have pci bus number but can't use it.

> >        * No need for yet another interface to bios to detect qemu version
> >          to check if it's safe to activate new code,
> >          and to ship multiple table versions:
> 
> We only care about supporting one version of SeaBIOS.  SeaBIOS should
> only care about supporting one version of QEMU.  We're not asking it to
> support multiple versions.

We do e.g. when we run with -M1.2 and migrate to old QEMU
early in boot process.


> >          we generated code so we know this is new qemu
> >        * Use of glib's GArray makes it much easier to build
> >          up tables in code without need for iasl and code patching
> 
> Adding a dynamic array to SeaBIOS isn't rocket science.

Glib reimplementation in bios will be more code than using existing
glib.

> >
> > I have also tried to address the concerns that
> > Anthony has raised with this approach, please see below.
> >
> > Design:
> >     - each bus gets assigned a number 0-255
> >     - generated ACPI code writes this number
> >       to a new BSEL register, then uses existing
> >       UP/DOWN registers to probe slot status;
> >       to eject, write number to BSEL register,
> >       then slot into existing EJ
> >
> >     This is to address the ACPI spec requirement to
> >     avoid config cycle access to any bus except PCI roots.
> >
> > Portability:
> >     - Non x86 (or any Linux) platforms don't need any of this code.
> >       They can keep happily using SHPC the way
> >       they always did.
> >
> > Things to note:
> >     - this is on top of acpi patchset posted a month ago,
> >       with a small patch adding a core function to walk all
> >       pci buses, on top.
> >       Can also be found in my git tree
> >         git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git acpi
> >
> >     - Extensive use of glib completely removes
> >       pointer math in new code: we use
> >       g_array_append_vals exclusively.
> >       There's no code patching in new code.
> >       This is in response to comments about security
> >       concerns with adding code to qemu.
> >       In this sense it is more secure than existing
> >       code in hw/acpi/core.c - that can be switched to
> >       glib if desired.
> >
> >     - New code (not imported from seabios)
> >       uses glib to reduce dependency on iasl.
> >       With time all code can be rewritted to
> >       use glib and avoid iasl, if desired.
> >
> >     - As was the case previously,
> >       systems that lack working iasl are detected at configure time,
> >       pre-generated hex files in source tree are used in this case.
> >       This addresses the concern with iasl/big-endian
> >       systems.
> >
> >     - Compile tested only. Migration is known to be
> >       broken - there are a bunch of TODO tags in code.
> >       It was agreed on the last qemu conf meeting that
> >       this code is posted without looking at migration,
> >       with understanding that it is addressed before
> >       being merged. Please do not mistake this
> >       limitation for a fundamental one - I have a
> >       very good idea how to fix it, including
> >       cross-version migration.
> >
> >     - Cross version migration: when running with -M 1.5
> >       and older, all ACPI table generation should be disabled.
> >       We'll present FW_CFG interface compatible with 1.5.
> 
> So TL;DR, you break a bunch of stuff

Confused.
When you asked me to write this up during last conf call,
you explicitly said not to bother with migration?
Almost any change affecting guest needs migration
related work.

> and introduce a mess of code.

What did you expect? New functionality with less code?

>  It
> would be less code and wouldn't break anything to add this logic to
> SeaBIOS.

It will be much more code - we would need to add a bunch of interfaces
to qemu then code using that to seabios, worry about running this
seabios on old qemu, etc etc.

> How is this even a discussion?

Interface change is minimized.

Interface to bios is not changed at all.
Interface to guests - by a single register.

Here we reuse existing fw cfg interface.
That's the point.

> Regards,
> 
> Anthony Liguori
> 
> >
> > Cc: Jordan Justen <jljusten@gmail.com>
> > Cc: Anthony Liguori <anthony@codemonkey.ws>
> > Cc: Laszlo Ersek <lersek@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: seabios@seabios.org
> > Cc: David Woodhouse <dwmw2@infradead.org>
> > Cc: Kevin O'Connor <kevin@koconnor.net>
> > Cc: Peter Maydell <peter.maydell@linaro.org>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  docs/specs/acpi_pci_hotplug.txt  |   8 +
> >  include/hw/i386/pc.h             |   4 +-
> >  hw/i386/acpi-dsdt.dsl            |  36 +++-
> >  hw/i386/ssdt-pcihp.dsl           |  51 -----
> >  hw/acpi/piix4.c                  | 145 ++++++++++----
> >  hw/i386/acpi-build.c             | 411 ++++++++++++++++++++++++++++++++++-----
> >  hw/i386/Makefile.objs            |   2 +-
> >  hw/i386/ssdt-pcihp.hex.generated | 108 ----------
> >  8 files changed, 510 insertions(+), 255 deletions(-)
> >  delete mode 100644 hw/i386/ssdt-pcihp.dsl
> >  delete mode 100644 hw/i386/ssdt-pcihp.hex.generated
> >
> > diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
> > index a839434..951b99a 100644
> > --- a/docs/specs/acpi_pci_hotplug.txt
> > +++ b/docs/specs/acpi_pci_hotplug.txt
> > @@ -43,3 +43,11 @@ PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
> >  
> >  Used by ACPI BIOS _RMV method to indicate removability status to OS. One
> >  bit per slot.  Read-only
> > +
> > +PCI bus selector (IO port 0xae10-0xae13, 4-byte access):
> > +-----------------------------------------------
> > +
> > +Used by ACPI BIOS methods to select the current PCI bus.
> > +When written, makes all the other PCI registers (0xae00 - 0xae0f)
> > +to refer to the appropriate bus.
> > +0 selects PCI root bus (default).
> > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > index f8d0871..66ec787 100644
> > --- a/include/hw/i386/pc.h
> > +++ b/include/hw/i386/pc.h
> > @@ -38,7 +38,9 @@ struct PcGuestInfo {
> >      bool s3_disabled;
> >      bool s4_disabled;
> >      uint8_t s4_val;
> > -    DECLARE_BITMAP(slot_hotplug_enable, PCI_SLOT_MAX);
> > +#define GUEST_INFO_MAX_HOTPLUG_BUS 256
> > +    PCIBus *hotplug_buses[GUEST_INFO_MAX_HOTPLUG_BUS];
> > +    DECLARE_BITMAP(hotplug_enable[GUEST_INFO_MAX_HOTPLUG_BUS], PCI_SLOT_MAX);
> >      uint16_t sci_int;
> >      uint8_t acpi_enable_cmd;
> >      uint8_t acpi_disable_cmd;
> > diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
> > index 90efce0..dc05668 100644
> > --- a/hw/i386/acpi-dsdt.dsl
> > +++ b/hw/i386/acpi-dsdt.dsl
> > @@ -133,11 +133,18 @@ DefinitionBlock (
> >              B0EJ, 32,
> >          }
> >  
> > +        OperationRegion(SEL, SystemIO, 0xae10, 0x04)
> > +        Field(SEL, DWordAcc, NoLock, WriteAsZeros) {
> > +            BSEL, 32,
> > +        }
> > +
> >          /* Methods called by bulk generated PCI devices below */
> > +        Name(BNUM, 0x00)
> >  
> >          /* Methods called by hotplug devices */
> > -        Method(PCEJ, 1, NotSerialized) {
> > +        Method(PCEJ, 2, NotSerialized) {
> >              // _EJ0 method - eject callback
> > +            Store(Arg1, BSEL)
> >              Store(ShiftLeft(1, Arg0), B0EJ)
> >              Return (0x0)
> >          }
> > @@ -147,16 +154,25 @@ DefinitionBlock (
> >  
> >          /* PCI hotplug notify method */
> >          Method(PCNF, 0) {
> > -            // Local0 = iterator
> > +            // Local0 = bridge selector
> >              Store(Zero, Local0)
> > -            While (LLess(Local0, 31)) {
> > -                Increment(Local0)
> > -                If (And(PCIU, ShiftLeft(1, Local0))) {
> > -                    PCNT(Local0, 1)
> > -                }
> > -                If (And(PCID, ShiftLeft(1, Local0))) {
> > -                    PCNT(Local0, 3)
> > -                }
> > +            While (LLess(Local0, 255)) {
> > +                    Store(Local0, BSEL)
> > +                    // Local1 = slot iterator
> > +                    Store(Zero, Local1)
> > +
> > +                    Store(PCIU, Local2)
> > +                    Store(PCID, Local3)
> > +                    While (LLess(Local1, 31)) {
> > +                        Increment(Local1)
> > +                        If (And(Local2, ShiftLeft(1, Local1))) {
> > +                            PCNT(Add(Local1, Multiply(Local0, 32)), 1)
> > +                        }
> > +                        If (And(Local3, ShiftLeft(1, Local1))) {
> > +                            PCNT(Add(Local1, Multiply(Local0, 32)), 3)
> > +                        }
> > +                    }
> > +                    Increment(Local0)
> >              }
> >          }
> >      }
> > diff --git a/hw/i386/ssdt-pcihp.dsl b/hw/i386/ssdt-pcihp.dsl
> > deleted file mode 100644
> > index d29a5b9..0000000
> > --- a/hw/i386/ssdt-pcihp.dsl
> > +++ /dev/null
> > @@ -1,51 +0,0 @@
> > -/*
> > - * This program is free software; you can redistribute it and/or modify
> > - * it under the terms of the GNU General Public License as published by
> > - * the Free Software Foundation; either version 2 of the License, or
> > - * (at your option) any later version.
> > -
> > - * This program is distributed in the hope that it will be useful,
> > - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > - * GNU General Public License for more details.
> > -
> > - * You should have received a copy of the GNU General Public License along
> > - * with this program; if not, see <http://www.gnu.org/licenses/>.
> > - */
> > -
> > -ACPI_EXTRACT_ALL_CODE ssdp_pcihp_aml
> > -
> > -DefinitionBlock ("ssdt-pcihp.aml", "SSDT", 0x01, "BXPC", "BXSSDTPCIHP", 0x1)
> > -{
> > -
> > -/****************************************************************
> > - * PCI hotplug
> > - ****************************************************************/
> > -
> > -    /* Objects supplied by DSDT */
> > -    External(\_SB.PCI0, DeviceObj)
> > -    External(\_SB.PCI0.PCEJ, MethodObj)
> > -
> > -    Scope(\_SB.PCI0) {
> > -
> > -        /* Bulk generated PCI hotplug devices */
> > -        ACPI_EXTRACT_DEVICE_START ssdt_pcihp_start
> > -        ACPI_EXTRACT_DEVICE_END ssdt_pcihp_end
> > -        ACPI_EXTRACT_DEVICE_STRING ssdt_pcihp_name
> > -
> > -        // Method _EJ0 can be patched by BIOS to EJ0_
> > -        // at runtime, if the slot is detected to not support hotplug.
> > -        // Extract the offset of the address dword and the
> > -        // _EJ0 name to allow this patching.
> > -        Device(SAA) {
> > -            ACPI_EXTRACT_NAME_BYTE_CONST ssdt_pcihp_id
> > -            Name(_SUN, 0xAA)
> > -            ACPI_EXTRACT_NAME_DWORD_CONST ssdt_pcihp_adr
> > -            Name(_ADR, 0xAA0000)
> > -            ACPI_EXTRACT_METHOD_STRING ssdt_pcihp_ej0
> > -            Method(_EJ0, 1) {
> > -                Return (PCEJ(_SUN))
> > -            }
> > -        }
> > -    }
> > -}
> > diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
> > index 19a26f5..10106fd 100644
> > --- a/hw/acpi/piix4.c
> > +++ b/hw/acpi/piix4.c
> > @@ -29,6 +29,7 @@
> >  #include "exec/ioport.h"
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "exec/address-spaces.h"
> > +#include "hw/pci/pci_bus.h"
> >  
> >  //#define DEBUG
> >  
> > @@ -42,11 +43,12 @@
> >  #define GPE_LEN 4
> >  
> >  #define PCI_HOTPLUG_ADDR 0xae00
> > -#define PCI_HOTPLUG_SIZE 0x000f
> > +#define PCI_HOTPLUG_SIZE 0x0014
> >  #define PCI_UP_BASE 0xae00
> >  #define PCI_DOWN_BASE 0xae04
> >  #define PCI_EJ_BASE 0xae08
> >  #define PCI_RMV_BASE 0xae0c
> > +#define PCI_SEL_BASE 0xae10
> >  
> >  #define PIIX4_PROC_BASE 0xaf00
> >  #define PIIX4_PROC_LEN 32
> > @@ -57,6 +59,8 @@
> >  struct pci_status {
> >      uint32_t up; /* deprecated, maintained for migration compatibility */
> >      uint32_t down;
> > +    uint32_t hotplug_enable;
> > +    uint32_t device_present;
> >  };
> >  
> >  typedef struct CPUStatus {
> > @@ -84,9 +88,8 @@ typedef struct PIIX4PMState {
> >      Notifier powerdown_notifier;
> >  
> >      /* for pci hotplug */
> > -    struct pci_status pci0_status;
> > -    uint32_t pci0_hotplug_enable;
> > -    uint32_t pci0_slot_device_present;
> > +    struct pci_status pci_status[GUEST_INFO_MAX_HOTPLUG_BUS];
> > +    uint32_t hotplug_select;
> >  
> >      uint8_t disable_s3;
> >      uint8_t disable_s4;
> > @@ -98,6 +101,17 @@ typedef struct PIIX4PMState {
> >      PcGuestInfo *guest_info;
> >  } PIIX4PMState;
> >  
> > +static int piix4_find_hotplug_bus(PcGuestInfo *guest_info, PCIBus *bus)
> > +{
> > +    int i;
> > +    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
> > +        if (guest_info->hotplug_buses[i] == bus) {
> > +            return i;
> > +        }
> > +    }
> > +    return -1;
> > +}
> > +
> >  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
> >                                             PCIBus *bus, PIIX4PMState *s);
> >  
> > @@ -183,6 +197,8 @@ static void pm_write_config(PCIDevice *d,
> >  
> >  static void vmstate_pci_status_pre_save(void *opaque)
> >  {
> > +#if 0
> > +    /* TODO */
> >      struct pci_status *pci0_status = opaque;
> >      PIIX4PMState *s = container_of(pci0_status, PIIX4PMState, pci0_status);
> >  
> > @@ -190,6 +206,7 @@ static void vmstate_pci_status_pre_save(void *opaque)
> >       * to a version that still does... of course these might get lost
> >       * by an old buggy implementation, but we try. */
> >      pci0_status->up = s->pci0_slot_device_present & s->pci0_hotplug_enable;
> > +#endif
> >  }
> >  
> >  static int vmstate_acpi_post_load(void *opaque, int version_id)
> > @@ -267,7 +284,10 @@ static int acpi_load_old(QEMUFile *f, void *opaque, int version_id)
> >          qemu_get_be16s(f, &temp);
> >      }
> >  
> > +#if 0
> > +    TODO
> >      ret = vmstate_load_state(f, &vmstate_pci_status, &s->pci0_status, 1);
> > +#endif
> >      return ret;
> >  }
> >  
> > @@ -293,21 +313,28 @@ static const VMStateDescription vmstate_acpi = {
> >          VMSTATE_TIMER(ar.tmr.timer, PIIX4PMState),
> >          VMSTATE_INT64(ar.tmr.overflow_time, PIIX4PMState),
> >          VMSTATE_STRUCT(ar.gpe, PIIX4PMState, 2, vmstate_gpe, ACPIGPE),
> > +#if 0
> >          VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
> >                         struct pci_status),
> > +#endif
> >          VMSTATE_END_OF_LIST()
> >      }
> >  };
> >  
> > -static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> > +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned bsel, unsigned slots)
> >  {
> >      BusChild *kid, *next;
> > -    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> > +    BusState *bus;
> >      int slot = ffs(slots) - 1;
> >      bool slot_free = true;
> >  
> > +    if (!s->guest_info->hotplug_buses[bsel]) {
> > +        return;
> > +    }
> > +    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
> > +
> >      /* Mark request as complete */
> > -    s->pci0_status.down &= ~(1U << slot);
> > +    s->pci_status[bsel].down &= ~(1U << slot);
> >  
> >      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
> >          DeviceState *qdev = kid->child;
> > @@ -322,23 +349,26 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> >          }
> >      }
> >      if (slot_free) {
> > -        s->pci0_slot_device_present &= ~(1U << slot);
> > +        s->pci_status[bsel].device_present &= ~(1U << slot);
> >      }
> >  }
> >  
> > -static void piix4_update_hotplug(PIIX4PMState *s)
> > +static void piix4_update_hotplug(PIIX4PMState *s, unsigned bsel)
> >  {
> > -    PCIDevice *dev = &s->dev;
> > -    BusState *bus = qdev_get_parent_bus(&dev->qdev);
> >      BusChild *kid, *next;
> > +    BusState *bus;
> >  
> > +    if (!s->guest_info->hotplug_buses[bsel]) {
> > +        return;
> > +    }
> > +    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
> >      /* Execute any pending removes during reset */
> > -    while (s->pci0_status.down) {
> > -        acpi_piix_eject_slot(s, s->pci0_status.down);
> > +    while (s->pci_status[bsel].down) {
> > +        acpi_piix_eject_slot(s, bsel, s->pci_status[bsel].down);
> >      }
> >  
> > -    s->pci0_hotplug_enable = ~0;
> > -    s->pci0_slot_device_present = 0;
> > +    s->pci_status[bsel].hotplug_enable = ~0;
> > +    s->pci_status[bsel].device_present = 0;
> >  
> >      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
> >          DeviceState *qdev = kid->child;
> > @@ -347,10 +377,10 @@ static void piix4_update_hotplug(PIIX4PMState *s)
> >          int slot = PCI_SLOT(pdev->devfn);
> >  
> >          if (pc->no_hotplug) {
> > -            s->pci0_hotplug_enable &= ~(1U << slot);
> > +            s->pci_status[bsel].hotplug_enable &= ~(1U << slot);
> >          }
> >  
> > -        s->pci0_slot_device_present |= (1U << slot);
> > +        s->pci_status[bsel].device_present |= (1U << slot);
> >      }
> >  }
> >  
> > @@ -358,6 +388,7 @@ static void piix4_reset(void *opaque)
> >  {
> >      PIIX4PMState *s = opaque;
> >      uint8_t *pci_conf = s->dev.config;
> > +    int i;
> >  
> >      pci_conf[0x58] = 0;
> >      pci_conf[0x59] = 0;
> > @@ -371,7 +402,9 @@ static void piix4_reset(void *opaque)
> >          /* Mark SMM as already inited (until KVM supports SMM). */
> >          pci_conf[0x5B] = 0x02;
> >      }
> > -    piix4_update_hotplug(s);
> > +    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
> > +        piix4_update_hotplug(s, i);
> > +    }
> >  }
> >  
> >  static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
> > @@ -382,14 +415,20 @@ static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
> >      acpi_pm1_evt_power_down(&s->ar);
> >  }
> >  
> > -static void piix4_update_guest_info(PIIX4PMState *s)
> > +static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> > +                                PCIHotplugState state);
> > +
> > +static void piix4_update_bus_guest_info(PCIBus *pci_bus, void *opaque)
> >  {
> > -    PCIDevice *dev = &s->dev;
> > -    BusState *bus = qdev_get_parent_bus(&dev->qdev);
> > +    PIIX4PMState *s = opaque;
> > +    PcGuestInfo *guest_info = s->guest_info;
> >      BusChild *kid, *next;
> > +    BusState *bus = &pci_bus->qbus;
> >  
> > -    memset(s->guest_info->slot_hotplug_enable, 0xff,
> > -           DIV_ROUND_UP(PCI_SLOT_MAX, BITS_PER_BYTE));
> > +    int i = piix4_find_hotplug_bus(guest_info, NULL);
> > +    assert(i >= 0);
> > +
> > +    guest_info->hotplug_buses[i] = pci_bus;
> >  
> >      QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
> >          DeviceState *qdev = kid->child;
> > @@ -398,9 +437,22 @@ static void piix4_update_guest_info(PIIX4PMState *s)
> >          int slot = PCI_SLOT(pdev->devfn);
> >  
> >          if (pc->no_hotplug) {
> > -            clear_bit(slot, s->guest_info->slot_hotplug_enable);
> > +            clear_bit(slot, guest_info->hotplug_enable[i]);
> >          }
> >      }
> > +
> > +    pci_bus_hotplug(pci_bus, piix4_device_hotplug, &s->dev.qdev);
> > +}
> > +
> > +static void piix4_update_guest_info(PIIX4PMState *s)
> > +{
> > +    PCIDevice *dev = &s->dev;
> > +
> > +    memset(s->guest_info->hotplug_enable, 0xff,
> > +           sizeof s->guest_info->hotplug_enable);
> > +
> > +    pci_for_each_bus(dev->bus, piix4_update_bus_guest_info,
> > +                     s);
> >  }
> >  
> >  static void piix4_pm_machine_ready(Notifier *n, void *opaque)
> > @@ -589,16 +641,22 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
> >  {
> >      PIIX4PMState *s = opaque;
> >      uint32_t val = 0;
> > +    int bsel = s->hotplug_select;
> > +
> > +    if (bsel > GUEST_INFO_MAX_HOTPLUG_BUS) {
> > +        return 0;
> > +    }
> >  
> >      switch (addr) {
> >      case PCI_UP_BASE - PCI_HOTPLUG_ADDR:
> >          /* Manufacture an "up" value to cause a device check on any hotplug
> >           * slot with a device.  Extra device checks are harmless. */
> > -        val = s->pci0_slot_device_present & s->pci0_hotplug_enable;
> > +        val = s->pci_status[bsel].device_present &
> > +            s->pci_status[bsel].hotplug_enable;
> >          PIIX4_DPRINTF("pci_up_read %x\n", val);
> >          break;
> >      case PCI_DOWN_BASE - PCI_HOTPLUG_ADDR:
> > -        val = s->pci0_status.down;
> > +        val = s->pci_status[bsel].down;
> >          PIIX4_DPRINTF("pci_down_read %x\n", val);
> >          break;
> >      case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
> > @@ -606,8 +664,10 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
> >          PIIX4_DPRINTF("pci_features_read %x\n", val);
> >          break;
> >      case PCI_RMV_BASE - PCI_HOTPLUG_ADDR:
> > -        val = s->pci0_hotplug_enable;
> > +        val = s->pci_status[bsel].hotplug_enable;
> >          break;
> > +    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
> > +        val = s->hotplug_select;
> >      default:
> >          break;
> >      }
> > @@ -618,12 +678,18 @@ static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
> >  static void pci_write(void *opaque, hwaddr addr, uint64_t data,
> >                        unsigned int size)
> >  {
> > +    PIIX4PMState *s = opaque;
> >      switch (addr) {
> >      case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
> > -        acpi_piix_eject_slot(opaque, (uint32_t)data);
> > +        if (s->hotplug_select >= GUEST_INFO_MAX_HOTPLUG_BUS) {
> > +            break;
> > +        }
> > +        acpi_piix_eject_slot(s, s->hotplug_select, data);
> >          PIIX4_DPRINTF("pciej write %" HWADDR_PRIx " <== % " PRIu64 "\n",
> >                        addr, data);
> >          break;
> > +    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
> > +        s->hotplug_select = data;
> >      default:
> >          break;
> >      }
> > @@ -706,9 +772,6 @@ static void piix4_init_cpu_status(CPUState *cpu, void *data)
> >      g->sts[id / 8] |= (1 << (id % 8));
> >  }
> >  
> > -static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> > -                                PCIHotplugState state);
> > -
> >  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
> >                                             PCIBus *bus, PIIX4PMState *s)
> >  {
> > @@ -720,8 +783,6 @@ static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
> >                            PCI_HOTPLUG_SIZE);
> >      memory_region_add_subregion(parent, PCI_HOTPLUG_ADDR,
> >                                  &s->io_pci);
> > -    pci_bus_hotplug(bus, piix4_device_hotplug, &s->dev.qdev);
> > -
> >      qemu_for_each_cpu(piix4_init_cpu_status, &s->gpe_cpu);
> >      memory_region_init_io(&s->io_cpu, &cpu_hotplug_ops, s, "apci-cpu-hotplug",
> >                            PIIX4_PROC_LEN);
> > @@ -730,16 +791,16 @@ static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
> >      qemu_register_cpu_added_notifier(&s->cpu_added_notifier);
> >  }
> >  
> > -static void enable_device(PIIX4PMState *s, int slot)
> > +static void enable_device(PIIX4PMState *s, unsigned bsel, int slot)
> >  {
> >      s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
> > -    s->pci0_slot_device_present |= (1U << slot);
> > +    s->pci_status[bsel].device_present |= (1U << slot);
> >  }
> >  
> > -static void disable_device(PIIX4PMState *s, int slot)
> > +static void disable_device(PIIX4PMState *s, unsigned bsel, int slot)
> >  {
> >      s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
> > -    s->pci0_status.down |= (1U << slot);
> > +    s->pci_status[bsel].down |= (1U << slot);
> >  }
> >  
> >  static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> > @@ -748,19 +809,23 @@ static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> >      int slot = PCI_SLOT(dev->devfn);
> >      PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev,
> >                                  PCI_DEVICE(qdev));
> > +    int bsel = piix4_find_hotplug_bus(s->guest_info, dev->bus);
> > +    if (bsel < 0) {
> > +        return -1;
> > +    }
> >  
> >      /* Don't send event when device is enabled during qemu machine creation:
> >       * it is present on boot, no hotplug event is necessary. We do send an
> >       * event when the device is disabled later. */
> >      if (state == PCI_COLDPLUG_ENABLED) {
> > -        s->pci0_slot_device_present |= (1U << slot);
> > +        s->pci_status[bsel].device_present |= (1U << slot);
> >          return 0;
> >      }
> >  
> >      if (state == PCI_HOTPLUG_ENABLED) {
> > -        enable_device(s, slot);
> > +        enable_device(s, bsel, slot);
> >      } else {
> > -        disable_device(s, slot);
> > +        disable_device(s, bsel, slot);
> >      }
> >  
> >      pm_update_sci(s);
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index 72606a8..43e4988 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -34,6 +34,7 @@
> >  #include "hw/acpi/acpi.h"
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "hw/i386/bios-linker-loader.h"
> > +#include "hw/pci/pci_bus.h"
> >  
> >  #define ACPI_BUILD_APPNAME  "Bochs"
> >  #define ACPI_BUILD_APPNAME6 "BOCHS "
> > @@ -258,23 +259,358 @@ acpi_encode_len(uint8_t *ssdt_ptr, int length, int bytes)
> >  #define ACPI_PROC_SIZEOF (*ssdt_proc_end - *ssdt_proc_start)
> >  #define ACPI_PROC_AML (ssdp_proc_aml + *ssdt_proc_start)
> >  
> > -/* 0x5B 0x82 DeviceOp PkgLength NameString */
> > -#define ACPI_PCIHP_OFFSET_HEX (*ssdt_pcihp_name - *ssdt_pcihp_start + 1)
> > -#define ACPI_PCIHP_OFFSET_ID (*ssdt_pcihp_id - *ssdt_pcihp_start)
> > -#define ACPI_PCIHP_OFFSET_ADR (*ssdt_pcihp_adr - *ssdt_pcihp_start)
> > -#define ACPI_PCIHP_OFFSET_EJ0 (*ssdt_pcihp_ej0 - *ssdt_pcihp_start)
> > -#define ACPI_PCIHP_SIZEOF (*ssdt_pcihp_end - *ssdt_pcihp_start)
> > -#define ACPI_PCIHP_AML (ssdp_pcihp_aml + *ssdt_pcihp_start)
> > -
> >  #define ACPI_SSDT_SIGNATURE 0x54445353 /* SSDT */
> >  #define ACPI_SSDT_HEADER_LENGTH 36
> >  
> >  #include "hw/i386/ssdt-misc.hex"
> > -#include "hw/i386/ssdt-pcihp.hex"
> > +
> > +static inline GArray *build_alloc_array(void)
> > +{
> > +        return g_array_new(false, true /* clear */, 1);
> > +}
> > +
> > +static inline void build_free_array(GArray *array)
> > +{
> > +        g_array_free(array, true);
> > +}
> > +
> > +static inline void build_prepend_byte(GArray *array, uint8_t val)
> > +{
> > +    g_array_prepend_val(array, val);
> > +}
> > +
> > +static inline void build_append_byte(GArray *array, uint8_t val)
> > +{
> > +    g_array_append_val(array, val);
> > +}
> > +
> > +static inline void build_prepend_array(GArray *array, GArray *val)
> > +{
> > +    g_array_prepend_vals(array, val->data, val->len);
> > +}
> > +
> > +static inline void build_append_array(GArray *array, GArray *val)
> > +{
> > +    g_array_append_vals(array, val->data, val->len);
> > +}
> > +
> > +static void build_append_nameseg(GArray *array, const char *format, ...)
> > +{
> > +    GString *s = g_string_new("");
> > +    va_list args;
> > +
> > +    va_start(args, format);
> > +    g_string_vprintf(s, format, args);
> > +    va_end(args);
> > +
> > +    assert(s->len == 4);
> > +    g_array_append_vals(array, s->str, s->len);
> > +    g_string_free(s, true);
> > +}
> > +
> > +static void build_prepend_nameseg(GArray *array, const char *format, ...)
> > +{
> > +    GString *s = g_string_new("");
> > +    va_list args;
> > +
> > +    va_start(args, format);
> > +    g_string_vprintf(s, format, args);
> > +    va_end(args);
> > +
> > +    assert(s->len == 4);
> > +    g_array_prepend_vals(array, s->str, s->len);
> > +    g_string_free(s, true);
> > +}
> > +
> > +static void build_prepend_devfn(GArray *name, uint8_t devfn)
> > +{
> > +    build_prepend_nameseg(name, "S%0.02x_", devfn);
> > +}
> > +
> > +static void build_append_devfn(GArray *name, uint8_t devfn)
> > +{
> > +    GArray *array = build_alloc_array();
> > +    build_prepend_devfn(array, devfn);
> > +    build_append_array(name, array);
> > +    build_free_array(array);
> > +}
> > +
> > +static void build_prepend_name_string(GArray *name, PCIBus *bus)
> > +{
> > +    int segcount;
> > +
> > +    for (;!pci_bus_is_root(bus); bus = bus->parent_dev->bus) {
> > +        build_prepend_devfn(name, bus->parent_dev->devfn);
> > +    }
> > +    g_array_prepend_vals(name, "PCI0", 4);
> > +    g_array_prepend_vals(name, "_SB_", 4);
> > +    /* Name segment length is a multiple of 4 */
> > +    assert(name->len % 4 == 0);
> > +    segcount = name->len / 4;
> > +    build_prepend_byte(name, segcount);
> > +    build_prepend_byte(name, 0x2f); /* MultiNamePrefix */
> > +    build_prepend_byte(name, '\\'); /* RootChar */
> > +}
> > +
> > +static void build_append_name_string(GArray *name, PCIBus *bus)
> > +{
> > +    GArray *append = build_alloc_array();
> > +    build_prepend_name_string(append, bus);
> > +    build_append_array(name, append);
> > +    build_free_array(append);
> > +}
> > +
> > +/* 5.4 Definition Block Encoding */
> > +enum {
> > +    PACKAGE_LENGTH_1BYTE_SHIFT = 6, /* Up to 63 - use extra 2 bits. */
> > +    PACKAGE_LENGTH_2BYTE_SHIFT = 4,
> > +    PACKAGE_LENGTH_3BYTE_SHIFT = 12,
> > +    PACKAGE_LENGTH_4BYTE_SHIFT = 20,
> > +};
> > +
> > +static void build_prepend_package_length(GArray *package)
> > +{
> > +    unsigned bytes;
> > +    uint8_t byte;
> > +    unsigned package_len = package->len;
> > +    unsigned length = package_len;
> > +
> > +    if (length + 1 < (1 << PACKAGE_LENGTH_1BYTE_SHIFT)) {
> > +        byte = length;
> > +        build_prepend_byte(package, byte);
> > +        return;
> > +    }
> > +
> > +    if (length + 4 > (1 << PACKAGE_LENGTH_4BYTE_SHIFT)) {
> > +        byte = length >> PACKAGE_LENGTH_4BYTE_SHIFT;
> > +        build_prepend_byte(package, byte);
> > +        length &= (1 << PACKAGE_LENGTH_4BYTE_SHIFT) - 1;
> > +    }
> > +    if (length + 3 > (1 << PACKAGE_LENGTH_3BYTE_SHIFT)) {
> > +        byte = length >> PACKAGE_LENGTH_3BYTE_SHIFT;
> > +        build_prepend_byte(package, byte);
> > +        length &= (1 << PACKAGE_LENGTH_3BYTE_SHIFT) - 1;
> > +    }
> > +    if (length + 2 > (1 << PACKAGE_LENGTH_2BYTE_SHIFT)) {
> > +        byte = length >> PACKAGE_LENGTH_2BYTE_SHIFT;
> > +        build_prepend_byte(package, byte);
> > +        length &= (1 << PACKAGE_LENGTH_2BYTE_SHIFT) - 1;
> > +    }
> > +
> > +    bytes = package->len - package_len;
> > +    byte = (bytes << PACKAGE_LENGTH_1BYTE_SHIFT) | length;
> > +    build_prepend_byte(package, byte);
> > +}
> > +
> > +static void build_package(GArray *package, uint8_t op)
> > +{
> > +    build_prepend_package_length(package);
> > +    build_prepend_byte(package, op);
> > +}
> > +
> > +static void build_extop_package(GArray *package, uint8_t op)
> > +{
> > +    build_package(package, op);
> > +    build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
> > +}
> > +
> > +static void build_append_notify(GArray *method, GArray *target_name,
> > +                                uint16_t value)
> > +{
> > +    GArray *notify = build_alloc_array();
> > +    uint8_t op = 0xA0; /* IfOp */
> > +
> > +    build_append_byte(notify, 0x93); /* LEqualOp */
> > +    build_append_byte(notify, 0x68); /* Arg0Op */
> > +    build_append_byte(notify, 0x0A); /* BytePrefix */
> > +    build_append_byte(notify, value);
> > +    build_append_byte(notify, value >> 8);
> > +    build_append_byte(notify, 0x86); /* NotifyOp */
> > +    build_append_array(notify, target_name);
> > +    build_append_byte(notify, 0x69); /* Arg1Op */
> > +
> > +    /* Pack it up */
> > +    build_package(notify, op);
> > +
> > +    build_append_array(method, notify);
> > +
> > +    build_free_array(notify);
> > +}
> > +
> > +static void
> > +build_append_notify_slots(GArray *method, PCIBus *bus,
> > +                          unsigned bus_select,
> > +                          DECLARE_BITMAP(hotplug_enable, PCI_SLOT_MAX))
> > +{
> > +    int i;
> > +
> > +    if (!bus) {
> > +        return;
> > +    }
> > +
> > +    for (i = 0; i < 32; i++) {
> > +        GArray *target;
> > +        if (!test_bit(i, hotplug_enable)) {
> > +            continue;
> > +        }
> > +        target = build_alloc_array();
> > +        build_prepend_devfn(target, i);
> > +        build_prepend_name_string(target, bus);
> > +        /* *32 matches Multiply in PCNF */
> > +        build_append_notify(method, target, bus_select * 32 + i);
> > +        build_free_array(target);
> > +    }
> > +}
> > +
> > +static void build_append_notify_buses(GArray *table, PcGuestInfo *guest_info)
> > +{
> > +    int i;
> > +    GArray *method = build_alloc_array();
> > +
> > +    uint8_t op = 0x14; /* MethodOp */
> > +
> > +    build_append_nameseg(method, "PCNT");
> > +    build_append_byte(method, 0x02); /* MethodFlags: ArgCount */
> > +    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
> > +        build_append_notify_slots(method, guest_info->hotplug_buses[i], i,
> > +                                  guest_info->hotplug_enable[i]);
> > +    }
> > +    build_package(method, op);
> > +
> > +    build_append_array(table, method);
> > +    build_free_array(method);
> > +}
> > +
> > +static void build_append_device(GArray *table, unsigned devfn)
> > +{
> > +    uint8_t op, ej0_op;
> > +
> > +    GArray *device = build_alloc_array();
> > +    GArray *ej0 = build_alloc_array();
> > +
> > +    op = 0x82; /* DeviceOp */
> > +
> > +    build_append_devfn(device, devfn);
> > +
> > +    build_append_byte(device, 0x08); /* NameOp */
> > +    build_append_nameseg(device, "_SUN");
> > +    build_append_byte(device, 0x0A); /* BytePrefix */
> > +    build_append_byte(device, PCI_SLOT(devfn));
> > +    build_append_byte(device, 0x08); /* NameOp */
> > +    build_append_nameseg(device, "_ADR");
> > +    build_append_byte(device, 0x0C); /* DWordPrefix */
> > +    build_append_byte(device, PCI_FUNC(devfn));
> > +    build_append_byte(device, 0x00);
> > +    build_append_byte(device, PCI_SLOT(devfn));
> > +    build_append_byte(device, 0x00);
> > +
> > +    ej0_op = 0x14;  /* MethodOp */
> > +    build_append_nameseg(ej0, "_EJ0");
> > +    build_append_byte(ej0, 0x01); /* MethodFlags: ArgCount */
> > +    build_append_byte(ej0, 0xA4); /* ReturnOp */
> > +    build_append_nameseg(ej0, "PCEJ");
> > +    build_append_nameseg(ej0, "_SUN");
> > +    build_append_nameseg(ej0, "BNUM");
> > +    build_package(ej0, ej0_op);
> > +    build_append_array(device, ej0);
> > +
> > +    build_extop_package(device, op);
> > +
> > +    build_append_array(table, device);
> > +    build_free_array(device);
> > +    build_free_array(ej0);
> > +}
> > +
> > +static bool build_find_device_bus(PCIBus *buses[GUEST_INFO_MAX_HOTPLUG_BUS],
> > +                                  PCIBus *bus, unsigned devfn)
> > +{
> > +    int i;
> > +    for (i = 0; i < ARRAY_SIZE(buses); ++i) {
> > +        if (buses[i] && buses[i]->parent_dev &&
> > +            buses[i]->parent_dev->bus == bus &&
> > +            buses[i]->parent_dev->devfn == devfn) {
> > +            return true;
> > +        }
> > +    }
> > +    return false;
> > +}
> > +
> > +static void
> > +build_append_bus(GArray *table,  int bnum, PcGuestInfo *guest_info)
> > +{
> > +    uint8_t op;
> > +    int slot;
> > +    unsigned devfn;
> > +    GArray *device;
> > +    PCIBus *bus = guest_info->hotplug_buses[bnum];
> > +
> > +    if (!bus) {
> > +        return;
> > +    }
> > +
> > +    device = build_alloc_array();
> > +
> > +    /* Root device described in DSDT */
> > +    if (bnum) {
> > +        devfn = bus->parent_dev->devfn;
> > +
> > +        op = 0x82; /* DeviceOp */
> > +
> > +        build_append_name_string(device, bus);
> > +
> > +        build_append_byte(device, 0x08); /* NameOp */
> > +        build_append_nameseg(device, "_SUN");
> > +        build_append_byte(device, 0x0A); /* BytePrefix */
> > +        build_append_byte(device, PCI_SLOT(devfn));
> > +        build_append_byte(device, 0x08); /* NameOp */
> > +        build_append_nameseg(device, "BNUM");
> > +        build_append_byte(device, 0x0A); /* BytePrefix */
> > +        build_append_byte(device, bnum);
> > +        build_append_byte(device, 0x08); /* NameOp */
> > +        build_append_nameseg(device, "_ADR");
> > +        build_append_byte(device, 0x0C); /* DWordPrefix */
> > +        build_append_byte(device, PCI_FUNC(devfn));
> > +        build_append_byte(device, 0x00);
> > +        build_append_byte(device, PCI_SLOT(devfn));
> > +        build_append_byte(device, 0x00);
> > +    } else {
> > +        op = 0x10; /* ScopeOp */
> > +        build_append_nameseg(device, "PCI0");
> > +    }
> > +
> > +    for (slot = 0; slot < PCI_SLOT_MAX; ++slot) {
> > +        if (!test_bit(slot, guest_info->hotplug_enable[bnum])) {
> > +            continue;
> > +        }
> > +        /* Bus device? We'll add it later */
> > +        if (build_find_device_bus(guest_info->hotplug_buses,
> > +                                  bus, PCI_DEVFN(slot, 0))) {
> > +            continue;
> > +        }
> > +        build_append_device(device, PCI_DEVFN(slot, 0));
> > +    }
> > +
> > +    if (bnum) {
> > +        build_extop_package(device, op);
> > +    } else {
> > +        build_package(device, op);
> > +    }
> > +
> > +    build_append_array(table, device);
> > +    build_free_array(device);
> > +}
> > +
> > +static void build_append_device_buses(GArray *table, PcGuestInfo *guest_info)
> > +{
> > +    int i;
> > +    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
> > +        build_append_bus(table, i, guest_info);
> > +    }
> > +}
> >  
> >  static uint8_t*
> >  build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
> > -             const char *target, int ofs)
> > +             const char *target, int ofs, int step)
> >  {
> >      int i;
> >  
> > @@ -295,31 +631,14 @@ build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
> >          *(ssdt_ptr++) = i;
> >          *(ssdt_ptr++) = 0x86; /* NotifyOp */
> >          memcpy(ssdt_ptr, target, 4);
> > -        ssdt_ptr[ofs] = acpi_get_hex(i >> 4);
> > -        ssdt_ptr[ofs + 1] = acpi_get_hex(i);
> > +        ssdt_ptr[ofs] = acpi_get_hex((i * step) >> 4);
> > +        ssdt_ptr[ofs + 1] = acpi_get_hex(i * step);
> >          ssdt_ptr += 4;
> >          *(ssdt_ptr++) = 0x69; /* Arg1Op */
> >      }
> >      return ssdt_ptr;
> >  }
> >  
> > -static void patch_pcihp(int slot, uint8_t *ssdt_ptr, uint32_t eject)
> > -{
> > -    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX] = acpi_get_hex(slot >> 4);
> > -    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX+1] = acpi_get_hex(slot);
> > -    ssdt_ptr[ACPI_PCIHP_OFFSET_ID] = slot;
> > -    ssdt_ptr[ACPI_PCIHP_OFFSET_ADR + 2] = slot;
> > -
> > -    /* Runtime patching of ACPI_EJ0: to disable hotplug for a slot,
> > -     * replace the method name: _EJ0 by ACPI_EJ0_. */
> > -    /* Sanity check */
> > -    assert (!memcmp(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "_EJ0", 4));
> > -
> > -    if (!eject) {
> > -        memcpy(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "EJ0_", 4);
> > -    }
> > -}
> > -
> >  static void
> >  build_ssdt(GArray *table_data, GArray *linker,
> >             FWCfgState *fw_cfg, PcGuestInfo *guest_info)
> > @@ -330,11 +649,20 @@ build_ssdt(GArray *table_data, GArray *linker,
> >                    + (acpi_cpus * ACPI_PROC_SIZEOF)               /* procs */
> >                    + (1+2+5+(12*acpi_cpus))                  /* NTFY */
> >                    + (6+2+1+(1*acpi_cpus))                   /* CPON */
> > -                  + (1+3+4)                                 /* Scope(PCI0) */
> > -                  + ((PCI_SLOT_MAX - 1) * ACPI_PCIHP_SIZEOF)        /* slots */
> > -                  + (1+2+5+(12*(PCI_SLOT_MAX - 1))));          /* PCNT */
> > -    uint8_t *ssdt = acpi_data_push(table_data, length);
> > -    uint8_t *ssdt_ptr = ssdt;
> > +                 );
> > +    uint8_t *ssdt;
> > +    uint8_t *ssdt_ptr;
> > +    GArray *devices = build_alloc_array();
> > +    GArray *notify = build_alloc_array();
> > +
> > +    /* TODO: rewrite this function using glib */
> > +    build_append_device_buses(devices, guest_info);
> > +    length += devices->len;
> > +    build_append_notify_buses(notify, guest_info);
> > +    length += notify->len;
> > +
> > +    ssdt = acpi_data_push(table_data, length);
> > +    ssdt_ptr = ssdt;
> >  
> >      /* Copy header and encode fwcfg values in the S3_ / S4_ / S5_ packages */
> >      memcpy(ssdt_ptr, ssdp_misc_aml, sizeof(ssdp_misc_aml));
> > @@ -389,7 +717,7 @@ build_ssdt(GArray *table_data, GArray *linker,
> >  
> >      /* build "Method(NTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CP00, Arg1)} ...}" */
> >      /* Arg0 = Processor ID = APIC ID */
> > -    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2);
> > +    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2, 1);
> >  
> >      /* build "Name(CPON, Package() { One, One, ..., Zero, Zero, ... })" */
> >      *(ssdt_ptr++) = 0x08; /* NameOp */
> > @@ -411,15 +739,10 @@ build_ssdt(GArray *table_data, GArray *linker,
> >      *(ssdt_ptr++) = 'I';
> >      *(ssdt_ptr++) = '0';
> >  
> > -    /* build Device object for each slot */
> > -    for (i = 1; i < PCI_SLOT_MAX; i++) {
> > -        bool eject = test_bit(i, guest_info->slot_hotplug_enable);
> > -        memcpy(ssdt_ptr, ACPI_PCIHP_AML, ACPI_PCIHP_SIZEOF);
> > -        patch_pcihp(i, ssdt_ptr, eject);
> > -        ssdt_ptr += ACPI_PCIHP_SIZEOF;
> > -    }
> > -
> > -    ssdt_ptr = build_notify(ssdt_ptr, "PCNT", 1, PCI_SLOT_MAX, "S00_", 1);
> > +    memcpy(ssdt_ptr, devices->data, devices->len);
> > +    ssdt_ptr += devices->len;
> > +    memcpy(ssdt_ptr, notify->data, notify->len);
> > +    ssdt_ptr += notify->len;
> >  
> >      build_header(linker, table_data,
> >                   (void*)ssdt, ACPI_SSDT_SIGNATURE, ssdt_ptr - ssdt, 1);
> > diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
> > index 2ab2572..c2e74e9 100644
> > --- a/hw/i386/Makefile.objs
> > +++ b/hw/i386/Makefile.objs
> > @@ -6,7 +6,7 @@ obj-$(CONFIG_XEN) += xen_domainbuild.o xen_machine_pv.o
> >  obj-y += kvmvapic.o
> >  obj-y += acpi-build.o
> >  obj-y += bios-linker-loader.o
> > -hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-pcihp.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
> > +hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
> >  hw/i386/pc_piix.o: hw/i386/pc_piix.c hw/i386/acpi-dsdt.hex
> >  hw/i386/pc_q35.o: hw/i386/pc_q35.c hw/i386/q35-acpi-dsdt.hex
> >  
> > diff --git a/hw/i386/ssdt-pcihp.hex.generated b/hw/i386/ssdt-pcihp.hex.generated
> > deleted file mode 100644
> > index 0d32a27..0000000
> > --- a/hw/i386/ssdt-pcihp.hex.generated
> > +++ /dev/null
> > @@ -1,108 +0,0 @@
> > -static unsigned char ssdt_pcihp_name[] = {
> > -0x33
> > -};
> > -static unsigned char ssdt_pcihp_adr[] = {
> > -0x44
> > -};
> > -static unsigned char ssdt_pcihp_end[] = {
> > -0x58
> > -};
> > -static unsigned char ssdp_pcihp_aml[] = {
> > -0x53,
> > -0x53,
> > -0x44,
> > -0x54,
> > -0x58,
> > -0x0,
> > -0x0,
> > -0x0,
> > -0x1,
> > -0x77,
> > -0x42,
> > -0x58,
> > -0x50,
> > -0x43,
> > -0x0,
> > -0x0,
> > -0x42,
> > -0x58,
> > -0x53,
> > -0x53,
> > -0x44,
> > -0x54,
> > -0x50,
> > -0x43,
> > -0x1,
> > -0x0,
> > -0x0,
> > -0x0,
> > -0x49,
> > -0x4e,
> > -0x54,
> > -0x4c,
> > -0x28,
> > -0x5,
> > -0x10,
> > -0x20,
> > -0x10,
> > -0x33,
> > -0x5c,
> > -0x2e,
> > -0x5f,
> > -0x53,
> > -0x42,
> > -0x5f,
> > -0x50,
> > -0x43,
> > -0x49,
> > -0x30,
> > -0x5b,
> > -0x82,
> > -0x26,
> > -0x53,
> > -0x41,
> > -0x41,
> > -0x5f,
> > -0x8,
> > -0x5f,
> > -0x53,
> > -0x55,
> > -0x4e,
> > -0xa,
> > -0xaa,
> > -0x8,
> > -0x5f,
> > -0x41,
> > -0x44,
> > -0x52,
> > -0xc,
> > -0x0,
> > -0x0,
> > -0xaa,
> > -0x0,
> > -0x14,
> > -0xf,
> > -0x5f,
> > -0x45,
> > -0x4a,
> > -0x30,
> > -0x1,
> > -0xa4,
> > -0x50,
> > -0x43,
> > -0x45,
> > -0x4a,
> > -0x5f,
> > -0x53,
> > -0x55,
> > -0x4e
> > -};
> > -static unsigned char ssdt_pcihp_start[] = {
> > -0x30
> > -};
> > -static unsigned char ssdt_pcihp_id[] = {
> > -0x3d
> > -};
> > -static unsigned char ssdt_pcihp_ej0[] = {
> > -0x4a
> > -};
> > -- 
> > MST
Gerd Hoffmann June 11, 2013, 5:33 a.m. UTC | #22
Hi,

>        * Use of glib's GArray makes it much easier to build
>          up tables in code without need for iasl and code patching

Nice.

> Design:
>     - each bus gets assigned a number 0-255
>     - generated ACPI code writes this number
>       to a new BSEL register, then uses existing
>       UP/DOWN registers to probe slot status;
>       to eject, write number to BSEL register,
>       then slot into existing EJ
> 
>     This is to address the ACPI spec requirement to
>     avoid config cycle access to any bus except PCI roots.
> 
> Portability:
>     - Non x86 (or any Linux) platforms don't need any of this code.
>       They can keep happily using SHPC the way
>       they always did.

Hmm.  Is is possible to write a SHPC driver in AML?  I think it would be
alot better to have one guest/host interface for pci bridge hotplug
instead of two.

cheers,
  Gerd
Michael S. Tsirkin June 11, 2013, 6:49 a.m. UTC | #23
On Mon, Jun 10, 2013 at 08:25:15PM -0500, Anthony Liguori wrote:
> > I do understand your desire to pass this stuff as parameters, but I
> > really don't see it as feasible.  I'm hoping that if you can write up
> > some examples with specifics you'll either enlighten me or you'll see
> > the difficulties I'm trying to describe.
> 
> Is your basic argument that generating these tables is hard, QEMU can
> and should just hard code them and change them whenever it feels like
> without involving SeaBIOS?

Basically I think it makes sense to listen to people
that write drivers for PV hardware that QEMU exposes.
bios is one such example.

If people that are supposed to use our interfaces
ask for a specific interface, I think we should listen.
If people come and say that building acpi dynamically is hard
for them and they want QEMU to prepare templates for acpi -
and there's overwhelming concensus on this -
QEMU should stop going "tough luck we know what's best for you".

> If that's what it boils down to, I'll look more closely at it.

So please look at the original patchset
[PATCH RFC 00/13] qemu: generate acpi tables for the guest

and let's move the discussion of it beyond the meta-argument
"QEMU does not need to generate any ACPI tables" 
and to the specific interface that we can use
to make life easier for the bios.

Specifically several options were discussed, on and off-line for passing
per-formatted tables to firmware:

1. use existing fw cfg interface from bios,
   qemu side make it work well with cross version migration.
2. same as 1 but find some existing hardware that is similar
   to fw cfg and emulate it. For example build a directory
   structure in some flash.
3. make qemu put tables into guest memory.
   bios needs to reserve this memory.
4. same as 3 but bios allocates memory and
   passes address to qemu.

Kevin suggested 1. This is what my patchset implements.
But I think bios could live with any of
the others, as long as it's a generic interface so
we can stop changing bios code all the time.
Michael S. Tsirkin June 11, 2013, 6:55 a.m. UTC | #24
On Tue, Jun 11, 2013 at 07:33:02AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> >        * Use of glib's GArray makes it much easier to build
> >          up tables in code without need for iasl and code patching
> 
> Nice.
> 
> > Design:
> >     - each bus gets assigned a number 0-255
> >     - generated ACPI code writes this number
> >       to a new BSEL register, then uses existing
> >       UP/DOWN registers to probe slot status;
> >       to eject, write number to BSEL register,
> >       then slot into existing EJ
> > 
> >     This is to address the ACPI spec requirement to
> >     avoid config cycle access to any bus except PCI roots.
> > 
> > Portability:
> >     - Non x86 (or any Linux) platforms don't need any of this code.
> >       They can keep happily using SHPC the way
> >       they always did.
> 
> Hmm.  Is is possible to write a SHPC driver in AML?  I think it would be
> alot better to have one guest/host interface for pci bridge hotplug
> instead of two.
> 
> cheers,
>   Gerd

No, it's not possible, SHPC is not designed to be used from ACPI.

Two reasons off the top of my head, there are likely others:

1. SHPC uses regular PCI interrupts to signal events. It does not signal
   GFE and SCI.

2. SHPC uses config accesses to get information from device.
   ACPI does not allow config access anywhere except the root bus from ACPI
   (This requirement is designed to give the OS freedom
   to reconfigure PCI in an arbitrary way).
Michael S. Tsirkin June 11, 2013, 7:35 a.m. UTC | #25
On Mon, Jun 10, 2013 at 08:03:46PM -0500, Anthony Liguori wrote:
> >> It is not "supported" by QEMU.
> >
> > No, but I've always thought that QEMU was happy to have alternative
> > firmware projects.
> 
> We are and we're happy to accept patches to enable things even if its
> proprietary.  But that's all assuming we're improving hardware
> emulation.
> 
> What we're talking about here is doing something that's very un-hardware like.

Two points
1. You never explained what you mean by un-hardware like.

   Currently bios is in a ROM device, and it has a
   template for ACPI tables together with it.
   This simply moves the tables to a separate ROM
   device (FW CFG), and generalizes the template using
   the linker interface.
   One ROM is hardware-like but two is un-hardware like?

   ACPI tables are static so it's likely lots of
   hardware has at least some of them pre-formatted in flash,
   then tweak some things like SRAT a bit.

2. There's no hardware-like alternative.

   A current alternative is exposing lots of PV interfaces to bios,
   bios builds up ACPI tables based on them.
   All of this code is in practice PC only,
   if we ever attempt to generalize we'll see how
   the interfaces are not a good match.
   They are still PV, poorly documented and it's very hard for
   bios not to make any assumptions about them.
   When it does, result is a very obscure breakage.

   Yes, building tables in bios means
   QEMU code is simpler but we double the
   amount of interfaces that we need to get right:
   QEMU-BIOS + BIOS-OS

   Yes it's "not QEMU code" but it's seen as
   a single package by users and it does not really
   matter - excellent hardware with lousy firmware is
   useless.
Gerd Hoffmann June 11, 2013, 7:42 a.m. UTC | #26
Hi,

>>> Portability:
>>>     - Non x86 (or any Linux) platforms don't need any of this code.
>>>       They can keep happily using SHPC the way
>>>       they always did.
>>
>> Hmm.  Is is possible to write a SHPC driver in AML?  I think it would be
>> alot better to have one guest/host interface for pci bridge hotplug
>> instead of two.
>>
>> cheers,
>>   Gerd
> 
> No, it's not possible, SHPC is not designed to be used from ACPI.
> 
> Two reasons off the top of my head, there are likely others:
> 
> 1. SHPC uses regular PCI interrupts to signal events. It does not signal
>    GFE and SCI.
> 
> 2. SHPC uses config accesses to get information from device.
>    ACPI does not allow config access anywhere except the root bus from ACPI
>    (This requirement is designed to give the OS freedom
>    to reconfigure PCI in an arbitrary way).

OK, so it's designed for OSes to have native SHPC support.  Linux has that?

Quick googling found me Windows Vista+ has it too, correct?  So that
leaves Win2k + WinXP versions.  Older Windows versions do not support
pci hotplug at all.  Win2k is EOL already.  WinXP will follow soon.

More users?

/me wonders whenever it is worth hopping through the loops needed to
support ACPI-based hotplug of devices behind bridges in the first place.

cheers,
  Gerd
Michael S. Tsirkin June 11, 2013, 7:53 a.m. UTC | #27
On Tue, Jun 11, 2013 at 09:42:29AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> >>> Portability:
> >>>     - Non x86 (or any Linux) platforms don't need any of this code.
> >>>       They can keep happily using SHPC the way
> >>>       they always did.
> >>
> >> Hmm.  Is is possible to write a SHPC driver in AML?  I think it would be
> >> alot better to have one guest/host interface for pci bridge hotplug
> >> instead of two.
> >>
> >> cheers,
> >>   Gerd
> > 
> > No, it's not possible, SHPC is not designed to be used from ACPI.
> > 
> > Two reasons off the top of my head, there are likely others:
> > 
> > 1. SHPC uses regular PCI interrupts to signal events. It does not signal
> >    GFE and SCI.
> > 
> > 2. SHPC uses config accesses to get information from device.
> >    ACPI does not allow config access anywhere except the root bus from ACPI
> >    (This requirement is designed to give the OS freedom
> >    to reconfigure PCI in an arbitrary way).
> 
> OK, so it's designed for OSes to have native SHPC support.  Linux has that?

Yes.

> Quick googling found me Windows Vista+ has it too, correct?  So that
> leaves Win2k + WinXP versions.  Older Windows versions do not support
> pci hotplug at all.  Win2k is EOL already.  WinXP will follow soon.
> 
> More users?

googling lead you astray.
No windows version supports SHPC.

> /me wonders whenever it is worth hopping through the loops needed to
> support ACPI-based hotplug of devices behind bridges in the first place.
> 
> cheers,
>   Gerd

Yes and not just because of windows guests.
ACPI spec is also very explicit that native hotplug is an optional
feature. Test suites such as WHQL are known to test spec compliance.
Gerd Hoffmann June 11, 2013, 8 a.m. UTC | #28
Hi,

> Yes and not just because of windows guests.
> ACPI spec is also very explicit that native hotplug is an optional
> feature. Test suites such as WHQL are known to test spec compliance.

/me looks a bit surprised.

This pretty much implies that any shpc bridge needs a second interface
to the hotplug functionality which can be driven via ACPI.  Or the
firmware somehow handles this using smm (not sure this works for the IRQ
though).

Do you know how this is handled by real hardware?

cheers,
  Gerd
Michael S. Tsirkin June 11, 2013, 8:18 a.m. UTC | #29
On Tue, Jun 11, 2013 at 10:00:36AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > Yes and not just because of windows guests.
> > ACPI spec is also very explicit that native hotplug is an optional
> > feature. Test suites such as WHQL are known to test spec compliance.
> 
> /me looks a bit surprised.
> 
> This pretty much implies that any shpc bridge needs a second interface
> to the hotplug functionality which can be driven via ACPI.  Or the
> firmware somehow handles this using smm (not sure this works for the IRQ
> though).
> 
> Do you know how this is handled by real hardware?
> 
> cheers,
>   Gerd

SHPC is not very widely deployed on a PC.

Since most hardware vendors do care about windows support,
the only way is a separate interface that is driven via ACPI.
You then need an ACPI specific register to switch to standard SHPC.
The SHPC spec even tells you as much.
Michael S. Tsirkin June 11, 2013, 8:27 a.m. UTC | #30
On Tue, Jun 11, 2013 at 11:18:13AM +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 11, 2013 at 10:00:36AM +0200, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > Yes and not just because of windows guests.
> > > ACPI spec is also very explicit that native hotplug is an optional
> > > feature. Test suites such as WHQL are known to test spec compliance.
> > 
> > /me looks a bit surprised.
> > 
> > This pretty much implies that any shpc bridge needs a second interface
> > to the hotplug functionality which can be driven via ACPI.  Or the
> > firmware somehow handles this using smm (not sure this works for the IRQ
> > though).
> > 
> > Do you know how this is handled by real hardware?
> > 
> > cheers,
> >   Gerd
> 
> SHPC is not very widely deployed on a PC.
> 
> Since most hardware vendors do care about windows support,
> the only way is a separate interface that is driven via ACPI.
> You then need an ACPI specific register to switch to standard SHPC.
> The SHPC spec even tells you as much.

Just to give another example, interface in this patch
scales trivially to multi-root configurations,
if we ever want to support them.

> -- 
> MST
David Woodhouse June 11, 2013, 2:04 p.m. UTC | #31
On Mon, 2013-06-10 at 20:03 -0500, Anthony Liguori wrote:
> 
> > I agree that the FAT driver is not 'free software' and I agree that is
> > a problem for usage with free software projects, such as QEMU. This is
> > a big deal, but unfortunately, as an Intel employee, I think I've done
> > as much as I can to address this.
> >
> > It couldn't hurt if more people that actually care about this spoke up
> > on edk2-devel about the issue, or perhaps within a UEFI working group.
> > Because, I know that they've stopped listening to me about it.
> 
> Is this useful?  I can try to make noise.  I assume since folks like
> you who have much more credibility and familiarity in this space have
> given up, it's a lost cause.

I don't think it's a lost cause. It's not technically hard to implement
a FAT driver. And the *bureaucracy* shouldn't be hard either.
Distributions ship *lots* of FAT implementations already. The alleged
patents on it, if there even *are* any left, are fraudulent. And don't
apply in the Free World where I live either, to my knowledge.

If we can ship FAT support in the Linux kernel, in mtools, and in
various other things including BSD kernels, then we can damn well ship a
UEFI module which implements FAT support. Saying that we can't do the
latter, while simultaneously continuing to do the former, makes
absolutely no sense.
David Woodhouse June 11, 2013, 2:11 p.m. UTC | #32
On Mon, 2013-06-10 at 19:11 -0500, Anthony Liguori wrote:
> If we did this right in QEMU, we'd have to introduce a QOM property
> anyway 

... and then we'd have to update each firmware implementation to know
about this new property, and the how it translates to the RESET_REG
field in the DSDT, etc.

So we *still* end up having to update firmwares to keep up with qemu,
much more than we would if we'd generated the tables on the qemu side.
Paolo Bonzini June 13, 2013, 11:02 p.m. UTC | #33
Il 10/06/2013 21:03, Anthony Liguori ha scritto:
>>> I'm not really convinced that
>>> QEMU<->firmware is a GPL boundary because of how tightly the two are
>>> linked.
>>
>> Where has 'linked' in terms of the GPL ever been anything other than
>> actual executable linking?
> 
> I should not have even brought this up as it's not worth debating.  If
> you're curious, http://www.gnu.org/licenses/gpl-faq.html#MereAggregation

With the usual IANAL care, I think QOM would be much much more of a
legal grey area that passing ACPI tables.

If you pass ACPI tables, the ACPI tables are clearly part of QEMU, and
are almost as clearly "just data" for the BIOS.

But if you use QOM, you may start wondering if "the semantics of the
communication are intimate enough, exchanging complex internal data
structures".  Probably not, as it is a generic interface and there could
be in principle other consumers than the firmware, but still complex
enough that raising the issue is not moot.

Paolo
Paolo Bonzini June 13, 2013, 11:05 p.m. UTC | #34
Il 11/06/2013 03:35, Michael S. Tsirkin ha scritto:
> Two points
> 1. You never explained what you mean by un-hardware like.
> 
>    Currently bios is in a ROM device, and it has a
>    template for ACPI tables together with it.
>    This simply moves the tables to a separate ROM
>    device (FW CFG), and generalizes the template using
>    the linker interface.
>    One ROM is hardware-like but two is un-hardware like?
> 
>    ACPI tables are static so it's likely lots of
>    hardware has at least some of them pre-formatted in flash,
>    then tweak some things like SRAT a bit.

Also having a "bootstrap processor" was certainly not unheard of some
decades ago.  Right now we get all sort of SMM hacks instead of adding
more processors, but it's certainly not un-hardware like.

Maybe we should just have a bytecode interpreter and write the ACPI
generator in that language. :)

Paolo
Laszlo Ersek June 14, 2013, 12:26 a.m. UTC | #35
On 06/14/13 01:02, Paolo Bonzini wrote:
> Il 10/06/2013 21:03, Anthony Liguori ha scritto:
>>>> I'm not really convinced that
>>>> QEMU<->firmware is a GPL boundary because of how tightly the two are
>>>> linked.
>>>
>>> Where has 'linked' in terms of the GPL ever been anything other than
>>> actual executable linking?
>>
>> I should not have even brought this up as it's not worth debating.  If
>> you're curious, http://www.gnu.org/licenses/gpl-faq.html#MereAggregation
> 
> With the usual IANAL care, I think QOM would be much much more of a
> legal grey area that passing ACPI tables.
> 
> If you pass ACPI tables, the ACPI tables are clearly part of QEMU, and
> are almost as clearly "just data" for the BIOS.
> 
> But if you use QOM, you may start wondering if "the semantics of the
> communication are intimate enough, exchanging complex internal data
> structures".  Probably not, as it is a generic interface and there could
> be in principle other consumers than the firmware, but still complex
> enough that raising the issue is not moot.

Basing the decision about

    derivative or not

on

    internal

or

    complex enough

; well I find that unusable.

First, complexity: web services can use very complex XML schemas, and
that clearly doesn't make the server derivative of the client, or vice
versa.

Second, internality: this attribute can be wiped out simply by writing
documentation for the data structure (which should be done *anyway*).
Once it is documented, it is not internal any longer. However existence
of documentation for a wire format between A and B should have
absolutely no say in whether codebase A is derivative of codebase B.

IANAL of course but I find the FSF's argument biased.

(Of course I'm also not buying that linking against a library makes the
client application (its own source code, or its dynamically linked
binary) derivative of the library. If there are two libraries
(especially when released under different licenses) implementing the
same API, which one is the application a derivative of?)

Laszlo
Anthony Liguori June 14, 2013, 12:59 a.m. UTC | #36
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 11/06/2013 03:35, Michael S. Tsirkin ha scritto:
>> Two points
>> 1. You never explained what you mean by un-hardware like.
>> 
>>    Currently bios is in a ROM device, and it has a
>>    template for ACPI tables together with it.
>>    This simply moves the tables to a separate ROM
>>    device (FW CFG), and generalizes the template using
>>    the linker interface.
>>    One ROM is hardware-like but two is un-hardware like?
>> 
>>    ACPI tables are static so it's likely lots of
>>    hardware has at least some of them pre-formatted in flash,
>>    then tweak some things like SRAT a bit.
>
> Also having a "bootstrap processor" was certainly not unheard of some
> decades ago.  Right now we get all sort of SMM hacks instead of adding
> more processors, but it's certainly not un-hardware like.

It's still not unheard of.  This is how power systems work still.

However, with PCs, the ACPI tables are generated by/included in the
firmware.  There's no question about that.

>
> Maybe we should just have a bytecode interpreter and write the ACPI
> generator in that language. :)

Indeed, we can even using an existing bytecode like the x86 instruction
set and use this VM called KVM to execute it.  I hear there are even C
compilers for this bytecode ;-)

Regards,

Anthony Liguori

> Paolo
Peter Stuge June 14, 2013, 1:23 a.m. UTC | #37
Anthony Liguori wrote:
> However, with PCs, the ACPI tables are generated by/included in the
> firmware.  There's no question about that.

I think the key point is that the firmware is developed and delivered
by the hardware vendor, and not by an independent source.

The firmware is intimately tied to the hardware, and is very much an
integral part of all hardware, to the point where the firmware is one
of the most prominent downloads made available by all hardware
vendors.

Don't make the mistake of thinking that firmware for a virtual
machine has less to do with the machine just because it's virtual.


//Peter
Laszlo Ersek June 16, 2013, 10 a.m. UTC | #38
On 06/14/13 02:26, Laszlo Ersek wrote:
> On 06/14/13 01:02, Paolo Bonzini wrote:
>> Il 10/06/2013 21:03, Anthony Liguori ha scritto:
>>>>> I'm not really convinced that
>>>>> QEMU<->firmware is a GPL boundary because of how tightly the two are
>>>>> linked.
>>>>
>>>> Where has 'linked' in terms of the GPL ever been anything other than
>>>> actual executable linking?
>>>
>>> I should not have even brought this up as it's not worth debating.  If
>>> you're curious, http://www.gnu.org/licenses/gpl-faq.html#MereAggregation
>>
>> With the usual IANAL care, I think QOM would be much much more of a
>> legal grey area that passing ACPI tables.
>>
>> If you pass ACPI tables, the ACPI tables are clearly part of QEMU, and
>> are almost as clearly "just data" for the BIOS.
>>
>> But if you use QOM, you may start wondering if "the semantics of the
>> communication are intimate enough, exchanging complex internal data
>> structures".  Probably not, as it is a generic interface and there could
>> be in principle other consumers than the firmware, but still complex
>> enough that raising the issue is not moot.
> 
> Basing the decision about
> 
>     derivative or not
> 
> on
> 
>     internal
> 
> or
> 
>     complex enough
> 
> ; well I find that unusable.
> 
> First, complexity: web services can use very complex XML schemas, and
> that clearly doesn't make the server derivative of the client, or vice
> versa.
> 
> Second, internality: this attribute can be wiped out simply by writing
> documentation for the data structure (which should be done *anyway*).
> Once it is documented, it is not internal any longer. However existence
> of documentation for a wire format between A and B should have
> absolutely no say in whether codebase A is derivative of codebase B.
> 
> IANAL of course but I find the FSF's argument biased.
> 
> (Of course I'm also not buying that linking against a library makes the
> client application (its own source code, or its dynamically linked
> binary) derivative of the library. If there are two libraries
> (especially when released under different licenses) implementing the
> same API, which one is the application a derivative of?)

This is my personal, private opinion, of course, which is independent of
what my employer holds.

Laszlo
diff mbox

Patch

diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
index a839434..951b99a 100644
--- a/docs/specs/acpi_pci_hotplug.txt
+++ b/docs/specs/acpi_pci_hotplug.txt
@@ -43,3 +43,11 @@  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
 
 Used by ACPI BIOS _RMV method to indicate removability status to OS. One
 bit per slot.  Read-only
+
+PCI bus selector (IO port 0xae10-0xae13, 4-byte access):
+-----------------------------------------------
+
+Used by ACPI BIOS methods to select the current PCI bus.
+When written, makes all the other PCI registers (0xae00 - 0xae0f)
+to refer to the appropriate bus.
+0 selects PCI root bus (default).
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index f8d0871..66ec787 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -38,7 +38,9 @@  struct PcGuestInfo {
     bool s3_disabled;
     bool s4_disabled;
     uint8_t s4_val;
-    DECLARE_BITMAP(slot_hotplug_enable, PCI_SLOT_MAX);
+#define GUEST_INFO_MAX_HOTPLUG_BUS 256
+    PCIBus *hotplug_buses[GUEST_INFO_MAX_HOTPLUG_BUS];
+    DECLARE_BITMAP(hotplug_enable[GUEST_INFO_MAX_HOTPLUG_BUS], PCI_SLOT_MAX);
     uint16_t sci_int;
     uint8_t acpi_enable_cmd;
     uint8_t acpi_disable_cmd;
diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
index 90efce0..dc05668 100644
--- a/hw/i386/acpi-dsdt.dsl
+++ b/hw/i386/acpi-dsdt.dsl
@@ -133,11 +133,18 @@  DefinitionBlock (
             B0EJ, 32,
         }
 
+        OperationRegion(SEL, SystemIO, 0xae10, 0x04)
+        Field(SEL, DWordAcc, NoLock, WriteAsZeros) {
+            BSEL, 32,
+        }
+
         /* Methods called by bulk generated PCI devices below */
+        Name(BNUM, 0x00)
 
         /* Methods called by hotplug devices */
-        Method(PCEJ, 1, NotSerialized) {
+        Method(PCEJ, 2, NotSerialized) {
             // _EJ0 method - eject callback
+            Store(Arg1, BSEL)
             Store(ShiftLeft(1, Arg0), B0EJ)
             Return (0x0)
         }
@@ -147,16 +154,25 @@  DefinitionBlock (
 
         /* PCI hotplug notify method */
         Method(PCNF, 0) {
-            // Local0 = iterator
+            // Local0 = bridge selector
             Store(Zero, Local0)
-            While (LLess(Local0, 31)) {
-                Increment(Local0)
-                If (And(PCIU, ShiftLeft(1, Local0))) {
-                    PCNT(Local0, 1)
-                }
-                If (And(PCID, ShiftLeft(1, Local0))) {
-                    PCNT(Local0, 3)
-                }
+            While (LLess(Local0, 255)) {
+                    Store(Local0, BSEL)
+                    // Local1 = slot iterator
+                    Store(Zero, Local1)
+
+                    Store(PCIU, Local2)
+                    Store(PCID, Local3)
+                    While (LLess(Local1, 31)) {
+                        Increment(Local1)
+                        If (And(Local2, ShiftLeft(1, Local1))) {
+                            PCNT(Add(Local1, Multiply(Local0, 32)), 1)
+                        }
+                        If (And(Local3, ShiftLeft(1, Local1))) {
+                            PCNT(Add(Local1, Multiply(Local0, 32)), 3)
+                        }
+                    }
+                    Increment(Local0)
             }
         }
     }
diff --git a/hw/i386/ssdt-pcihp.dsl b/hw/i386/ssdt-pcihp.dsl
deleted file mode 100644
index d29a5b9..0000000
--- a/hw/i386/ssdt-pcihp.dsl
+++ /dev/null
@@ -1,51 +0,0 @@ 
-/*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
-
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
-
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, see <http://www.gnu.org/licenses/>.
- */
-
-ACPI_EXTRACT_ALL_CODE ssdp_pcihp_aml
-
-DefinitionBlock ("ssdt-pcihp.aml", "SSDT", 0x01, "BXPC", "BXSSDTPCIHP", 0x1)
-{
-
-/****************************************************************
- * PCI hotplug
- ****************************************************************/
-
-    /* Objects supplied by DSDT */
-    External(\_SB.PCI0, DeviceObj)
-    External(\_SB.PCI0.PCEJ, MethodObj)
-
-    Scope(\_SB.PCI0) {
-
-        /* Bulk generated PCI hotplug devices */
-        ACPI_EXTRACT_DEVICE_START ssdt_pcihp_start
-        ACPI_EXTRACT_DEVICE_END ssdt_pcihp_end
-        ACPI_EXTRACT_DEVICE_STRING ssdt_pcihp_name
-
-        // Method _EJ0 can be patched by BIOS to EJ0_
-        // at runtime, if the slot is detected to not support hotplug.
-        // Extract the offset of the address dword and the
-        // _EJ0 name to allow this patching.
-        Device(SAA) {
-            ACPI_EXTRACT_NAME_BYTE_CONST ssdt_pcihp_id
-            Name(_SUN, 0xAA)
-            ACPI_EXTRACT_NAME_DWORD_CONST ssdt_pcihp_adr
-            Name(_ADR, 0xAA0000)
-            ACPI_EXTRACT_METHOD_STRING ssdt_pcihp_ej0
-            Method(_EJ0, 1) {
-                Return (PCEJ(_SUN))
-            }
-        }
-    }
-}
diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
index 19a26f5..10106fd 100644
--- a/hw/acpi/piix4.c
+++ b/hw/acpi/piix4.c
@@ -29,6 +29,7 @@ 
 #include "exec/ioport.h"
 #include "hw/nvram/fw_cfg.h"
 #include "exec/address-spaces.h"
+#include "hw/pci/pci_bus.h"
 
 //#define DEBUG
 
@@ -42,11 +43,12 @@ 
 #define GPE_LEN 4
 
 #define PCI_HOTPLUG_ADDR 0xae00
-#define PCI_HOTPLUG_SIZE 0x000f
+#define PCI_HOTPLUG_SIZE 0x0014
 #define PCI_UP_BASE 0xae00
 #define PCI_DOWN_BASE 0xae04
 #define PCI_EJ_BASE 0xae08
 #define PCI_RMV_BASE 0xae0c
+#define PCI_SEL_BASE 0xae10
 
 #define PIIX4_PROC_BASE 0xaf00
 #define PIIX4_PROC_LEN 32
@@ -57,6 +59,8 @@ 
 struct pci_status {
     uint32_t up; /* deprecated, maintained for migration compatibility */
     uint32_t down;
+    uint32_t hotplug_enable;
+    uint32_t device_present;
 };
 
 typedef struct CPUStatus {
@@ -84,9 +88,8 @@  typedef struct PIIX4PMState {
     Notifier powerdown_notifier;
 
     /* for pci hotplug */
-    struct pci_status pci0_status;
-    uint32_t pci0_hotplug_enable;
-    uint32_t pci0_slot_device_present;
+    struct pci_status pci_status[GUEST_INFO_MAX_HOTPLUG_BUS];
+    uint32_t hotplug_select;
 
     uint8_t disable_s3;
     uint8_t disable_s4;
@@ -98,6 +101,17 @@  typedef struct PIIX4PMState {
     PcGuestInfo *guest_info;
 } PIIX4PMState;
 
+static int piix4_find_hotplug_bus(PcGuestInfo *guest_info, PCIBus *bus)
+{
+    int i;
+    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
+        if (guest_info->hotplug_buses[i] == bus) {
+            return i;
+        }
+    }
+    return -1;
+}
+
 static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
                                            PCIBus *bus, PIIX4PMState *s);
 
@@ -183,6 +197,8 @@  static void pm_write_config(PCIDevice *d,
 
 static void vmstate_pci_status_pre_save(void *opaque)
 {
+#if 0
+    /* TODO */
     struct pci_status *pci0_status = opaque;
     PIIX4PMState *s = container_of(pci0_status, PIIX4PMState, pci0_status);
 
@@ -190,6 +206,7 @@  static void vmstate_pci_status_pre_save(void *opaque)
      * to a version that still does... of course these might get lost
      * by an old buggy implementation, but we try. */
     pci0_status->up = s->pci0_slot_device_present & s->pci0_hotplug_enable;
+#endif
 }
 
 static int vmstate_acpi_post_load(void *opaque, int version_id)
@@ -267,7 +284,10 @@  static int acpi_load_old(QEMUFile *f, void *opaque, int version_id)
         qemu_get_be16s(f, &temp);
     }
 
+#if 0
+    TODO
     ret = vmstate_load_state(f, &vmstate_pci_status, &s->pci0_status, 1);
+#endif
     return ret;
 }
 
@@ -293,21 +313,28 @@  static const VMStateDescription vmstate_acpi = {
         VMSTATE_TIMER(ar.tmr.timer, PIIX4PMState),
         VMSTATE_INT64(ar.tmr.overflow_time, PIIX4PMState),
         VMSTATE_STRUCT(ar.gpe, PIIX4PMState, 2, vmstate_gpe, ACPIGPE),
+#if 0
         VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
                        struct pci_status),
+#endif
         VMSTATE_END_OF_LIST()
     }
 };
 
-static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
+static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned bsel, unsigned slots)
 {
     BusChild *kid, *next;
-    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
+    BusState *bus;
     int slot = ffs(slots) - 1;
     bool slot_free = true;
 
+    if (!s->guest_info->hotplug_buses[bsel]) {
+        return;
+    }
+    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
+
     /* Mark request as complete */
-    s->pci0_status.down &= ~(1U << slot);
+    s->pci_status[bsel].down &= ~(1U << slot);
 
     QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
         DeviceState *qdev = kid->child;
@@ -322,23 +349,26 @@  static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
         }
     }
     if (slot_free) {
-        s->pci0_slot_device_present &= ~(1U << slot);
+        s->pci_status[bsel].device_present &= ~(1U << slot);
     }
 }
 
-static void piix4_update_hotplug(PIIX4PMState *s)
+static void piix4_update_hotplug(PIIX4PMState *s, unsigned bsel)
 {
-    PCIDevice *dev = &s->dev;
-    BusState *bus = qdev_get_parent_bus(&dev->qdev);
     BusChild *kid, *next;
+    BusState *bus;
 
+    if (!s->guest_info->hotplug_buses[bsel]) {
+        return;
+    }
+    bus = &s->guest_info->hotplug_buses[bsel]->qbus;
     /* Execute any pending removes during reset */
-    while (s->pci0_status.down) {
-        acpi_piix_eject_slot(s, s->pci0_status.down);
+    while (s->pci_status[bsel].down) {
+        acpi_piix_eject_slot(s, bsel, s->pci_status[bsel].down);
     }
 
-    s->pci0_hotplug_enable = ~0;
-    s->pci0_slot_device_present = 0;
+    s->pci_status[bsel].hotplug_enable = ~0;
+    s->pci_status[bsel].device_present = 0;
 
     QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
         DeviceState *qdev = kid->child;
@@ -347,10 +377,10 @@  static void piix4_update_hotplug(PIIX4PMState *s)
         int slot = PCI_SLOT(pdev->devfn);
 
         if (pc->no_hotplug) {
-            s->pci0_hotplug_enable &= ~(1U << slot);
+            s->pci_status[bsel].hotplug_enable &= ~(1U << slot);
         }
 
-        s->pci0_slot_device_present |= (1U << slot);
+        s->pci_status[bsel].device_present |= (1U << slot);
     }
 }
 
@@ -358,6 +388,7 @@  static void piix4_reset(void *opaque)
 {
     PIIX4PMState *s = opaque;
     uint8_t *pci_conf = s->dev.config;
+    int i;
 
     pci_conf[0x58] = 0;
     pci_conf[0x59] = 0;
@@ -371,7 +402,9 @@  static void piix4_reset(void *opaque)
         /* Mark SMM as already inited (until KVM supports SMM). */
         pci_conf[0x5B] = 0x02;
     }
-    piix4_update_hotplug(s);
+    for (i = 0; i < GUEST_INFO_MAX_HOTPLUG_BUS; ++i) {
+        piix4_update_hotplug(s, i);
+    }
 }
 
 static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
@@ -382,14 +415,20 @@  static void piix4_pm_powerdown_req(Notifier *n, void *opaque)
     acpi_pm1_evt_power_down(&s->ar);
 }
 
-static void piix4_update_guest_info(PIIX4PMState *s)
+static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
+                                PCIHotplugState state);
+
+static void piix4_update_bus_guest_info(PCIBus *pci_bus, void *opaque)
 {
-    PCIDevice *dev = &s->dev;
-    BusState *bus = qdev_get_parent_bus(&dev->qdev);
+    PIIX4PMState *s = opaque;
+    PcGuestInfo *guest_info = s->guest_info;
     BusChild *kid, *next;
+    BusState *bus = &pci_bus->qbus;
 
-    memset(s->guest_info->slot_hotplug_enable, 0xff,
-           DIV_ROUND_UP(PCI_SLOT_MAX, BITS_PER_BYTE));
+    int i = piix4_find_hotplug_bus(guest_info, NULL);
+    assert(i >= 0);
+
+    guest_info->hotplug_buses[i] = pci_bus;
 
     QTAILQ_FOREACH_SAFE(kid, &bus->children, sibling, next) {
         DeviceState *qdev = kid->child;
@@ -398,9 +437,22 @@  static void piix4_update_guest_info(PIIX4PMState *s)
         int slot = PCI_SLOT(pdev->devfn);
 
         if (pc->no_hotplug) {
-            clear_bit(slot, s->guest_info->slot_hotplug_enable);
+            clear_bit(slot, guest_info->hotplug_enable[i]);
         }
     }
+
+    pci_bus_hotplug(pci_bus, piix4_device_hotplug, &s->dev.qdev);
+}
+
+static void piix4_update_guest_info(PIIX4PMState *s)
+{
+    PCIDevice *dev = &s->dev;
+
+    memset(s->guest_info->hotplug_enable, 0xff,
+           sizeof s->guest_info->hotplug_enable);
+
+    pci_for_each_bus(dev->bus, piix4_update_bus_guest_info,
+                     s);
 }
 
 static void piix4_pm_machine_ready(Notifier *n, void *opaque)
@@ -589,16 +641,22 @@  static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
 {
     PIIX4PMState *s = opaque;
     uint32_t val = 0;
+    int bsel = s->hotplug_select;
+
+    if (bsel > GUEST_INFO_MAX_HOTPLUG_BUS) {
+        return 0;
+    }
 
     switch (addr) {
     case PCI_UP_BASE - PCI_HOTPLUG_ADDR:
         /* Manufacture an "up" value to cause a device check on any hotplug
          * slot with a device.  Extra device checks are harmless. */
-        val = s->pci0_slot_device_present & s->pci0_hotplug_enable;
+        val = s->pci_status[bsel].device_present &
+            s->pci_status[bsel].hotplug_enable;
         PIIX4_DPRINTF("pci_up_read %x\n", val);
         break;
     case PCI_DOWN_BASE - PCI_HOTPLUG_ADDR:
-        val = s->pci0_status.down;
+        val = s->pci_status[bsel].down;
         PIIX4_DPRINTF("pci_down_read %x\n", val);
         break;
     case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
@@ -606,8 +664,10 @@  static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
         PIIX4_DPRINTF("pci_features_read %x\n", val);
         break;
     case PCI_RMV_BASE - PCI_HOTPLUG_ADDR:
-        val = s->pci0_hotplug_enable;
+        val = s->pci_status[bsel].hotplug_enable;
         break;
+    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
+        val = s->hotplug_select;
     default:
         break;
     }
@@ -618,12 +678,18 @@  static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
 static void pci_write(void *opaque, hwaddr addr, uint64_t data,
                       unsigned int size)
 {
+    PIIX4PMState *s = opaque;
     switch (addr) {
     case PCI_EJ_BASE - PCI_HOTPLUG_ADDR:
-        acpi_piix_eject_slot(opaque, (uint32_t)data);
+        if (s->hotplug_select >= GUEST_INFO_MAX_HOTPLUG_BUS) {
+            break;
+        }
+        acpi_piix_eject_slot(s, s->hotplug_select, data);
         PIIX4_DPRINTF("pciej write %" HWADDR_PRIx " <== % " PRIu64 "\n",
                       addr, data);
         break;
+    case PCI_SEL_BASE - PCI_HOTPLUG_ADDR:
+        s->hotplug_select = data;
     default:
         break;
     }
@@ -706,9 +772,6 @@  static void piix4_init_cpu_status(CPUState *cpu, void *data)
     g->sts[id / 8] |= (1 << (id % 8));
 }
 
-static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
-                                PCIHotplugState state);
-
 static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
                                            PCIBus *bus, PIIX4PMState *s)
 {
@@ -720,8 +783,6 @@  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
                           PCI_HOTPLUG_SIZE);
     memory_region_add_subregion(parent, PCI_HOTPLUG_ADDR,
                                 &s->io_pci);
-    pci_bus_hotplug(bus, piix4_device_hotplug, &s->dev.qdev);
-
     qemu_for_each_cpu(piix4_init_cpu_status, &s->gpe_cpu);
     memory_region_init_io(&s->io_cpu, &cpu_hotplug_ops, s, "apci-cpu-hotplug",
                           PIIX4_PROC_LEN);
@@ -730,16 +791,16 @@  static void piix4_acpi_system_hot_add_init(MemoryRegion *parent,
     qemu_register_cpu_added_notifier(&s->cpu_added_notifier);
 }
 
-static void enable_device(PIIX4PMState *s, int slot)
+static void enable_device(PIIX4PMState *s, unsigned bsel, int slot)
 {
     s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
-    s->pci0_slot_device_present |= (1U << slot);
+    s->pci_status[bsel].device_present |= (1U << slot);
 }
 
-static void disable_device(PIIX4PMState *s, int slot)
+static void disable_device(PIIX4PMState *s, unsigned bsel, int slot)
 {
     s->ar.gpe.sts[0] |= PIIX4_PCI_HOTPLUG_STATUS;
-    s->pci0_status.down |= (1U << slot);
+    s->pci_status[bsel].down |= (1U << slot);
 }
 
 static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
@@ -748,19 +809,23 @@  static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
     int slot = PCI_SLOT(dev->devfn);
     PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev,
                                 PCI_DEVICE(qdev));
+    int bsel = piix4_find_hotplug_bus(s->guest_info, dev->bus);
+    if (bsel < 0) {
+        return -1;
+    }
 
     /* Don't send event when device is enabled during qemu machine creation:
      * it is present on boot, no hotplug event is necessary. We do send an
      * event when the device is disabled later. */
     if (state == PCI_COLDPLUG_ENABLED) {
-        s->pci0_slot_device_present |= (1U << slot);
+        s->pci_status[bsel].device_present |= (1U << slot);
         return 0;
     }
 
     if (state == PCI_HOTPLUG_ENABLED) {
-        enable_device(s, slot);
+        enable_device(s, bsel, slot);
     } else {
-        disable_device(s, slot);
+        disable_device(s, bsel, slot);
     }
 
     pm_update_sci(s);
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 72606a8..43e4988 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -34,6 +34,7 @@ 
 #include "hw/acpi/acpi.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/i386/bios-linker-loader.h"
+#include "hw/pci/pci_bus.h"
 
 #define ACPI_BUILD_APPNAME  "Bochs"
 #define ACPI_BUILD_APPNAME6 "BOCHS "
@@ -258,23 +259,358 @@  acpi_encode_len(uint8_t *ssdt_ptr, int length, int bytes)
 #define ACPI_PROC_SIZEOF (*ssdt_proc_end - *ssdt_proc_start)
 #define ACPI_PROC_AML (ssdp_proc_aml + *ssdt_proc_start)
 
-/* 0x5B 0x82 DeviceOp PkgLength NameString */
-#define ACPI_PCIHP_OFFSET_HEX (*ssdt_pcihp_name - *ssdt_pcihp_start + 1)
-#define ACPI_PCIHP_OFFSET_ID (*ssdt_pcihp_id - *ssdt_pcihp_start)
-#define ACPI_PCIHP_OFFSET_ADR (*ssdt_pcihp_adr - *ssdt_pcihp_start)
-#define ACPI_PCIHP_OFFSET_EJ0 (*ssdt_pcihp_ej0 - *ssdt_pcihp_start)
-#define ACPI_PCIHP_SIZEOF (*ssdt_pcihp_end - *ssdt_pcihp_start)
-#define ACPI_PCIHP_AML (ssdp_pcihp_aml + *ssdt_pcihp_start)
-
 #define ACPI_SSDT_SIGNATURE 0x54445353 /* SSDT */
 #define ACPI_SSDT_HEADER_LENGTH 36
 
 #include "hw/i386/ssdt-misc.hex"
-#include "hw/i386/ssdt-pcihp.hex"
+
+static inline GArray *build_alloc_array(void)
+{
+        return g_array_new(false, true /* clear */, 1);
+}
+
+static inline void build_free_array(GArray *array)
+{
+        g_array_free(array, true);
+}
+
+static inline void build_prepend_byte(GArray *array, uint8_t val)
+{
+    g_array_prepend_val(array, val);
+}
+
+static inline void build_append_byte(GArray *array, uint8_t val)
+{
+    g_array_append_val(array, val);
+}
+
+static inline void build_prepend_array(GArray *array, GArray *val)
+{
+    g_array_prepend_vals(array, val->data, val->len);
+}
+
+static inline void build_append_array(GArray *array, GArray *val)
+{
+    g_array_append_vals(array, val->data, val->len);
+}
+
+static void build_append_nameseg(GArray *array, const char *format, ...)
+{
+    GString *s = g_string_new("");
+    va_list args;
+
+    va_start(args, format);
+    g_string_vprintf(s, format, args);
+    va_end(args);
+
+    assert(s->len == 4);
+    g_array_append_vals(array, s->str, s->len);
+    g_string_free(s, true);
+}
+
+static void build_prepend_nameseg(GArray *array, const char *format, ...)
+{
+    GString *s = g_string_new("");
+    va_list args;
+
+    va_start(args, format);
+    g_string_vprintf(s, format, args);
+    va_end(args);
+
+    assert(s->len == 4);
+    g_array_prepend_vals(array, s->str, s->len);
+    g_string_free(s, true);
+}
+
+static void build_prepend_devfn(GArray *name, uint8_t devfn)
+{
+    build_prepend_nameseg(name, "S%0.02x_", devfn);
+}
+
+static void build_append_devfn(GArray *name, uint8_t devfn)
+{
+    GArray *array = build_alloc_array();
+    build_prepend_devfn(array, devfn);
+    build_append_array(name, array);
+    build_free_array(array);
+}
+
+static void build_prepend_name_string(GArray *name, PCIBus *bus)
+{
+    int segcount;
+
+    for (;!pci_bus_is_root(bus); bus = bus->parent_dev->bus) {
+        build_prepend_devfn(name, bus->parent_dev->devfn);
+    }
+    g_array_prepend_vals(name, "PCI0", 4);
+    g_array_prepend_vals(name, "_SB_", 4);
+    /* Name segment length is a multiple of 4 */
+    assert(name->len % 4 == 0);
+    segcount = name->len / 4;
+    build_prepend_byte(name, segcount);
+    build_prepend_byte(name, 0x2f); /* MultiNamePrefix */
+    build_prepend_byte(name, '\\'); /* RootChar */
+}
+
+static void build_append_name_string(GArray *name, PCIBus *bus)
+{
+    GArray *append = build_alloc_array();
+    build_prepend_name_string(append, bus);
+    build_append_array(name, append);
+    build_free_array(append);
+}
+
+/* 5.4 Definition Block Encoding */
+enum {
+    PACKAGE_LENGTH_1BYTE_SHIFT = 6, /* Up to 63 - use extra 2 bits. */
+    PACKAGE_LENGTH_2BYTE_SHIFT = 4,
+    PACKAGE_LENGTH_3BYTE_SHIFT = 12,
+    PACKAGE_LENGTH_4BYTE_SHIFT = 20,
+};
+
+static void build_prepend_package_length(GArray *package)
+{
+    unsigned bytes;
+    uint8_t byte;
+    unsigned package_len = package->len;
+    unsigned length = package_len;
+
+    if (length + 1 < (1 << PACKAGE_LENGTH_1BYTE_SHIFT)) {
+        byte = length;
+        build_prepend_byte(package, byte);
+        return;
+    }
+
+    if (length + 4 > (1 << PACKAGE_LENGTH_4BYTE_SHIFT)) {
+        byte = length >> PACKAGE_LENGTH_4BYTE_SHIFT;
+        build_prepend_byte(package, byte);
+        length &= (1 << PACKAGE_LENGTH_4BYTE_SHIFT) - 1;
+    }
+    if (length + 3 > (1 << PACKAGE_LENGTH_3BYTE_SHIFT)) {
+        byte = length >> PACKAGE_LENGTH_3BYTE_SHIFT;
+        build_prepend_byte(package, byte);
+        length &= (1 << PACKAGE_LENGTH_3BYTE_SHIFT) - 1;
+    }
+    if (length + 2 > (1 << PACKAGE_LENGTH_2BYTE_SHIFT)) {
+        byte = length >> PACKAGE_LENGTH_2BYTE_SHIFT;
+        build_prepend_byte(package, byte);
+        length &= (1 << PACKAGE_LENGTH_2BYTE_SHIFT) - 1;
+    }
+
+    bytes = package->len - package_len;
+    byte = (bytes << PACKAGE_LENGTH_1BYTE_SHIFT) | length;
+    build_prepend_byte(package, byte);
+}
+
+static void build_package(GArray *package, uint8_t op)
+{
+    build_prepend_package_length(package);
+    build_prepend_byte(package, op);
+}
+
+static void build_extop_package(GArray *package, uint8_t op)
+{
+    build_package(package, op);
+    build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
+}
+
+static void build_append_notify(GArray *method, GArray *target_name,
+                                uint16_t value)
+{
+    GArray *notify = build_alloc_array();
+    uint8_t op = 0xA0; /* IfOp */
+
+    build_append_byte(notify, 0x93); /* LEqualOp */
+    build_append_byte(notify, 0x68); /* Arg0Op */
+    build_append_byte(notify, 0x0A); /* BytePrefix */
+    build_append_byte(notify, value);
+    build_append_byte(notify, value >> 8);
+    build_append_byte(notify, 0x86); /* NotifyOp */
+    build_append_array(notify, target_name);
+    build_append_byte(notify, 0x69); /* Arg1Op */
+
+    /* Pack it up */
+    build_package(notify, op);
+
+    build_append_array(method, notify);
+
+    build_free_array(notify);
+}
+
+static void
+build_append_notify_slots(GArray *method, PCIBus *bus,
+                          unsigned bus_select,
+                          DECLARE_BITMAP(hotplug_enable, PCI_SLOT_MAX))
+{
+    int i;
+
+    if (!bus) {
+        return;
+    }
+
+    for (i = 0; i < 32; i++) {
+        GArray *target;
+        if (!test_bit(i, hotplug_enable)) {
+            continue;
+        }
+        target = build_alloc_array();
+        build_prepend_devfn(target, i);
+        build_prepend_name_string(target, bus);
+        /* *32 matches Multiply in PCNF */
+        build_append_notify(method, target, bus_select * 32 + i);
+        build_free_array(target);
+    }
+}
+
+static void build_append_notify_buses(GArray *table, PcGuestInfo *guest_info)
+{
+    int i;
+    GArray *method = build_alloc_array();
+
+    uint8_t op = 0x14; /* MethodOp */
+
+    build_append_nameseg(method, "PCNT");
+    build_append_byte(method, 0x02); /* MethodFlags: ArgCount */
+    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
+        build_append_notify_slots(method, guest_info->hotplug_buses[i], i,
+                                  guest_info->hotplug_enable[i]);
+    }
+    build_package(method, op);
+
+    build_append_array(table, method);
+    build_free_array(method);
+}
+
+static void build_append_device(GArray *table, unsigned devfn)
+{
+    uint8_t op, ej0_op;
+
+    GArray *device = build_alloc_array();
+    GArray *ej0 = build_alloc_array();
+
+    op = 0x82; /* DeviceOp */
+
+    build_append_devfn(device, devfn);
+
+    build_append_byte(device, 0x08); /* NameOp */
+    build_append_nameseg(device, "_SUN");
+    build_append_byte(device, 0x0A); /* BytePrefix */
+    build_append_byte(device, PCI_SLOT(devfn));
+    build_append_byte(device, 0x08); /* NameOp */
+    build_append_nameseg(device, "_ADR");
+    build_append_byte(device, 0x0C); /* DWordPrefix */
+    build_append_byte(device, PCI_FUNC(devfn));
+    build_append_byte(device, 0x00);
+    build_append_byte(device, PCI_SLOT(devfn));
+    build_append_byte(device, 0x00);
+
+    ej0_op = 0x14;  /* MethodOp */
+    build_append_nameseg(ej0, "_EJ0");
+    build_append_byte(ej0, 0x01); /* MethodFlags: ArgCount */
+    build_append_byte(ej0, 0xA4); /* ReturnOp */
+    build_append_nameseg(ej0, "PCEJ");
+    build_append_nameseg(ej0, "_SUN");
+    build_append_nameseg(ej0, "BNUM");
+    build_package(ej0, ej0_op);
+    build_append_array(device, ej0);
+
+    build_extop_package(device, op);
+
+    build_append_array(table, device);
+    build_free_array(device);
+    build_free_array(ej0);
+}
+
+static bool build_find_device_bus(PCIBus *buses[GUEST_INFO_MAX_HOTPLUG_BUS],
+                                  PCIBus *bus, unsigned devfn)
+{
+    int i;
+    for (i = 0; i < ARRAY_SIZE(buses); ++i) {
+        if (buses[i] && buses[i]->parent_dev &&
+            buses[i]->parent_dev->bus == bus &&
+            buses[i]->parent_dev->devfn == devfn) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static void
+build_append_bus(GArray *table,  int bnum, PcGuestInfo *guest_info)
+{
+    uint8_t op;
+    int slot;
+    unsigned devfn;
+    GArray *device;
+    PCIBus *bus = guest_info->hotplug_buses[bnum];
+
+    if (!bus) {
+        return;
+    }
+
+    device = build_alloc_array();
+
+    /* Root device described in DSDT */
+    if (bnum) {
+        devfn = bus->parent_dev->devfn;
+
+        op = 0x82; /* DeviceOp */
+
+        build_append_name_string(device, bus);
+
+        build_append_byte(device, 0x08); /* NameOp */
+        build_append_nameseg(device, "_SUN");
+        build_append_byte(device, 0x0A); /* BytePrefix */
+        build_append_byte(device, PCI_SLOT(devfn));
+        build_append_byte(device, 0x08); /* NameOp */
+        build_append_nameseg(device, "BNUM");
+        build_append_byte(device, 0x0A); /* BytePrefix */
+        build_append_byte(device, bnum);
+        build_append_byte(device, 0x08); /* NameOp */
+        build_append_nameseg(device, "_ADR");
+        build_append_byte(device, 0x0C); /* DWordPrefix */
+        build_append_byte(device, PCI_FUNC(devfn));
+        build_append_byte(device, 0x00);
+        build_append_byte(device, PCI_SLOT(devfn));
+        build_append_byte(device, 0x00);
+    } else {
+        op = 0x10; /* ScopeOp */
+        build_append_nameseg(device, "PCI0");
+    }
+
+    for (slot = 0; slot < PCI_SLOT_MAX; ++slot) {
+        if (!test_bit(slot, guest_info->hotplug_enable[bnum])) {
+            continue;
+        }
+        /* Bus device? We'll add it later */
+        if (build_find_device_bus(guest_info->hotplug_buses,
+                                  bus, PCI_DEVFN(slot, 0))) {
+            continue;
+        }
+        build_append_device(device, PCI_DEVFN(slot, 0));
+    }
+
+    if (bnum) {
+        build_extop_package(device, op);
+    } else {
+        build_package(device, op);
+    }
+
+    build_append_array(table, device);
+    build_free_array(device);
+}
+
+static void build_append_device_buses(GArray *table, PcGuestInfo *guest_info)
+{
+    int i;
+    for (i = 0; i < ARRAY_SIZE(guest_info->hotplug_buses); ++i) {
+        build_append_bus(table, i, guest_info);
+    }
+}
 
 static uint8_t*
 build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
-             const char *target, int ofs)
+             const char *target, int ofs, int step)
 {
     int i;
 
@@ -295,31 +631,14 @@  build_notify(uint8_t *ssdt_ptr, const char *name, int skip, int count,
         *(ssdt_ptr++) = i;
         *(ssdt_ptr++) = 0x86; /* NotifyOp */
         memcpy(ssdt_ptr, target, 4);
-        ssdt_ptr[ofs] = acpi_get_hex(i >> 4);
-        ssdt_ptr[ofs + 1] = acpi_get_hex(i);
+        ssdt_ptr[ofs] = acpi_get_hex((i * step) >> 4);
+        ssdt_ptr[ofs + 1] = acpi_get_hex(i * step);
         ssdt_ptr += 4;
         *(ssdt_ptr++) = 0x69; /* Arg1Op */
     }
     return ssdt_ptr;
 }
 
-static void patch_pcihp(int slot, uint8_t *ssdt_ptr, uint32_t eject)
-{
-    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX] = acpi_get_hex(slot >> 4);
-    ssdt_ptr[ACPI_PCIHP_OFFSET_HEX+1] = acpi_get_hex(slot);
-    ssdt_ptr[ACPI_PCIHP_OFFSET_ID] = slot;
-    ssdt_ptr[ACPI_PCIHP_OFFSET_ADR + 2] = slot;
-
-    /* Runtime patching of ACPI_EJ0: to disable hotplug for a slot,
-     * replace the method name: _EJ0 by ACPI_EJ0_. */
-    /* Sanity check */
-    assert (!memcmp(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "_EJ0", 4));
-
-    if (!eject) {
-        memcpy(ssdt_ptr + ACPI_PCIHP_OFFSET_EJ0, "EJ0_", 4);
-    }
-}
-
 static void
 build_ssdt(GArray *table_data, GArray *linker,
            FWCfgState *fw_cfg, PcGuestInfo *guest_info)
@@ -330,11 +649,20 @@  build_ssdt(GArray *table_data, GArray *linker,
                   + (acpi_cpus * ACPI_PROC_SIZEOF)               /* procs */
                   + (1+2+5+(12*acpi_cpus))                  /* NTFY */
                   + (6+2+1+(1*acpi_cpus))                   /* CPON */
-                  + (1+3+4)                                 /* Scope(PCI0) */
-                  + ((PCI_SLOT_MAX - 1) * ACPI_PCIHP_SIZEOF)        /* slots */
-                  + (1+2+5+(12*(PCI_SLOT_MAX - 1))));          /* PCNT */
-    uint8_t *ssdt = acpi_data_push(table_data, length);
-    uint8_t *ssdt_ptr = ssdt;
+                 );
+    uint8_t *ssdt;
+    uint8_t *ssdt_ptr;
+    GArray *devices = build_alloc_array();
+    GArray *notify = build_alloc_array();
+
+    /* TODO: rewrite this function using glib */
+    build_append_device_buses(devices, guest_info);
+    length += devices->len;
+    build_append_notify_buses(notify, guest_info);
+    length += notify->len;
+
+    ssdt = acpi_data_push(table_data, length);
+    ssdt_ptr = ssdt;
 
     /* Copy header and encode fwcfg values in the S3_ / S4_ / S5_ packages */
     memcpy(ssdt_ptr, ssdp_misc_aml, sizeof(ssdp_misc_aml));
@@ -389,7 +717,7 @@  build_ssdt(GArray *table_data, GArray *linker,
 
     /* build "Method(NTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CP00, Arg1)} ...}" */
     /* Arg0 = Processor ID = APIC ID */
-    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2);
+    ssdt_ptr = build_notify(ssdt_ptr, "NTFY", 0, acpi_cpus, "CP00", 2, 1);
 
     /* build "Name(CPON, Package() { One, One, ..., Zero, Zero, ... })" */
     *(ssdt_ptr++) = 0x08; /* NameOp */
@@ -411,15 +739,10 @@  build_ssdt(GArray *table_data, GArray *linker,
     *(ssdt_ptr++) = 'I';
     *(ssdt_ptr++) = '0';
 
-    /* build Device object for each slot */
-    for (i = 1; i < PCI_SLOT_MAX; i++) {
-        bool eject = test_bit(i, guest_info->slot_hotplug_enable);
-        memcpy(ssdt_ptr, ACPI_PCIHP_AML, ACPI_PCIHP_SIZEOF);
-        patch_pcihp(i, ssdt_ptr, eject);
-        ssdt_ptr += ACPI_PCIHP_SIZEOF;
-    }
-
-    ssdt_ptr = build_notify(ssdt_ptr, "PCNT", 1, PCI_SLOT_MAX, "S00_", 1);
+    memcpy(ssdt_ptr, devices->data, devices->len);
+    ssdt_ptr += devices->len;
+    memcpy(ssdt_ptr, notify->data, notify->len);
+    ssdt_ptr += notify->len;
 
     build_header(linker, table_data,
                  (void*)ssdt, ACPI_SSDT_SIGNATURE, ssdt_ptr - ssdt, 1);
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 2ab2572..c2e74e9 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -6,7 +6,7 @@  obj-$(CONFIG_XEN) += xen_domainbuild.o xen_machine_pv.o
 obj-y += kvmvapic.o
 obj-y += acpi-build.o
 obj-y += bios-linker-loader.o
-hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-pcihp.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
+hw/i386/acpi-build.o: hw/i386/acpi-build.c hw/i386/acpi-dsdt.hex hw/i386/ssdt-proc.hex hw/i386/ssdt-misc.hex hw/i386/q35-acpi-dsdt.hex
 hw/i386/pc_piix.o: hw/i386/pc_piix.c hw/i386/acpi-dsdt.hex
 hw/i386/pc_q35.o: hw/i386/pc_q35.c hw/i386/q35-acpi-dsdt.hex
 
diff --git a/hw/i386/ssdt-pcihp.hex.generated b/hw/i386/ssdt-pcihp.hex.generated
deleted file mode 100644
index 0d32a27..0000000
--- a/hw/i386/ssdt-pcihp.hex.generated
+++ /dev/null
@@ -1,108 +0,0 @@ 
-static unsigned char ssdt_pcihp_name[] = {
-0x33
-};
-static unsigned char ssdt_pcihp_adr[] = {
-0x44
-};
-static unsigned char ssdt_pcihp_end[] = {
-0x58
-};
-static unsigned char ssdp_pcihp_aml[] = {
-0x53,
-0x53,
-0x44,
-0x54,
-0x58,
-0x0,
-0x0,
-0x0,
-0x1,
-0x77,
-0x42,
-0x58,
-0x50,
-0x43,
-0x0,
-0x0,
-0x42,
-0x58,
-0x53,
-0x53,
-0x44,
-0x54,
-0x50,
-0x43,
-0x1,
-0x0,
-0x0,
-0x0,
-0x49,
-0x4e,
-0x54,
-0x4c,
-0x28,
-0x5,
-0x10,
-0x20,
-0x10,
-0x33,
-0x5c,
-0x2e,
-0x5f,
-0x53,
-0x42,
-0x5f,
-0x50,
-0x43,
-0x49,
-0x30,
-0x5b,
-0x82,
-0x26,
-0x53,
-0x41,
-0x41,
-0x5f,
-0x8,
-0x5f,
-0x53,
-0x55,
-0x4e,
-0xa,
-0xaa,
-0x8,
-0x5f,
-0x41,
-0x44,
-0x52,
-0xc,
-0x0,
-0x0,
-0xaa,
-0x0,
-0x14,
-0xf,
-0x5f,
-0x45,
-0x4a,
-0x30,
-0x1,
-0xa4,
-0x50,
-0x43,
-0x45,
-0x4a,
-0x5f,
-0x53,
-0x55,
-0x4e
-};
-static unsigned char ssdt_pcihp_start[] = {
-0x30
-};
-static unsigned char ssdt_pcihp_id[] = {
-0x3d
-};
-static unsigned char ssdt_pcihp_ej0[] = {
-0x4a
-};