diff mbox series

[v2,6/7] x68: acpi: trigger SMI before sending hotplug Notify event to OSPM

Message ID 20200818122208.1243901-7-imammedo@redhat.com
State New
Headers show
Series x86: fix cpu hotplug with secure boot | expand

Commit Message

Igor Mammedov Aug. 18, 2020, 12:22 p.m. UTC
In case firmware has negotiated CPU hotplug SMI feature, generate
AML to describe SMI IO port region and send SMI to firmware
on each CPU hotplug SCI in case new CPUs were hotplugged.

Since new CPUs can be hotplugged while CPU_SCAN_METHOD is running
we can't send SMI before new CPUs are fetched from QEMU as it
could cause sending Notify to a CPU that firmware hasn't seen yet.
So fetch new CPUs into local cache first and then send SMI and
after that sends Notify events to cached CPUs. This should ensure
that Notify is sent only to CPUs which were processed by firmware
first.
Any CPUs that were hotplugged after caching will be processed
by the next CPU_SCAN_METHOD, when pending SCI is handled.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
v2:
 - clear insert event after firmware has returned
   control from SMI. (Laszlo Ersek <lersek@redhat.com>)
v1:
 - make sure that Notify is sent only to CPUs seen by FW
 - (Laszlo Ersek <lersek@redhat.com>)
    - use macro instead of x-smi-negotiated-features
    - style fixes
    - reserve whole SMI block (0xB2, 0xB3)
v0:
 - s/aml_string/aml_eisaid/
   /fixes Windows BSOD, on nonsense value/ (Laszlo Ersek <lersek@redhat.com>)
 - put SMI device under PCI0 like the rest of hotplug IO port
 - do not generate SMI AML if CPU hotplug SMI feature hasn't been negotiated

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 include/hw/acpi/cpu.h  |  1 +
 include/hw/i386/ich9.h |  2 ++
 hw/acpi/cpu.c          | 80 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/acpi-build.c   | 35 +++++++++++++++++-
 hw/isa/lpc_ich9.c      |  3 ++
 5 files changed, 115 insertions(+), 6 deletions(-)

Comments

Laszlo Ersek Aug. 25, 2020, 5:25 p.m. UTC | #1
On 08/18/20 14:22, Igor Mammedov wrote:
> In case firmware has negotiated CPU hotplug SMI feature, generate
> AML to describe SMI IO port region and send SMI to firmware
> on each CPU hotplug SCI in case new CPUs were hotplugged.
>
> Since new CPUs can be hotplugged while CPU_SCAN_METHOD is running
> we can't send SMI before new CPUs are fetched from QEMU as it
> could cause sending Notify to a CPU that firmware hasn't seen yet.
> So fetch new CPUs into local cache first and then send SMI and
> after that sends Notify events to cached CPUs. This should ensure
> that Notify is sent only to CPUs which were processed by firmware
> first.
> Any CPUs that were hotplugged after caching will be processed
> by the next CPU_SCAN_METHOD, when pending SCI is handled.
>
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
> v2:
>  - clear insert event after firmware has returned
>    control from SMI. (Laszlo Ersek <lersek@redhat.com>)
> v1:
>  - make sure that Notify is sent only to CPUs seen by FW
>  - (Laszlo Ersek <lersek@redhat.com>)
>     - use macro instead of x-smi-negotiated-features
>     - style fixes
>     - reserve whole SMI block (0xB2, 0xB3)
> v0:
>  - s/aml_string/aml_eisaid/
>    /fixes Windows BSOD, on nonsense value/ (Laszlo Ersek <lersek@redhat.com>)
>  - put SMI device under PCI0 like the rest of hotplug IO port
>  - do not generate SMI AML if CPU hotplug SMI feature hasn't been negotiated
>
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  include/hw/acpi/cpu.h  |  1 +
>  include/hw/i386/ich9.h |  2 ++
>  hw/acpi/cpu.c          | 80 +++++++++++++++++++++++++++++++++++++++---
>  hw/i386/acpi-build.c   | 35 +++++++++++++++++-
>  hw/isa/lpc_ich9.c      |  3 ++
>  5 files changed, 115 insertions(+), 6 deletions(-)

For easing review, this patch could be split in several patches, I
think. I'm not saying that's a "requirement", just that it would be
possible, and could help review.


* Patch "a":

[subject:]

  x86: ich9: expose "smi_negotiated_features" as a QOM property

[commit message:]

  Expose the "smi_negotiated_features" field of ICH9LPCState as a QOM
  property.

[code extracted from current patch follows]

> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> index d1bb3f7bf0..0f43ef2481 100644
> --- a/include/hw/i386/ich9.h
> +++ b/include/hw/i386/ich9.h
> @@ -245,6 +245,8 @@ typedef struct ICH9LPCState {
>  #define ICH9_SMB_HST_D1                         0x06
>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>
> +#define ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP "x-smi-negotiated-features"
> +
>  /* bit positions used in fw_cfg SMI feature negotiation */
>  #define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>  #define ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT          1
> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> index 19f32bed3e..8124d20338 100644
> --- a/hw/isa/lpc_ich9.c
> +++ b/hw/isa/lpc_ich9.c
> @@ -647,6 +647,9 @@ static void ich9_lpc_initfn(Object *obj)
>                                    &acpi_enable_cmd, OBJ_PROP_FLAG_READ);
>      object_property_add_uint8_ptr(OBJECT(lpc), ACPI_PM_PROP_ACPI_DISABLE_CMD,
>                                    &acpi_disable_cmd, OBJ_PROP_FLAG_READ);
> +    object_property_add_uint64_ptr(obj, ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP,
> +                                   &lpc->smi_negotiated_features,
> +                                   OBJ_PROP_FLAG_READ);
>
>      ich9_pm_add_properties(obj, &lpc->pm);
>  }
>

For sub-patch "a":
Reviewed-by: Laszlo Ersek <lersek@redhat.com>


* Patch "b":

[subject:]

  x86: acpi: introduce "AcpiPmInfo.smi_on_cpuhp"

[commit message:]

  Translate the "CPU hotplug with SMI" feature bit, from the property
  added in the last patch, to a dedicated boolean in AcpiPmInfo.

[code extracted from current patch follows]

> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index b7bcbbbb2a..2291c050ba 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -95,6 +95,7 @@ typedef struct AcpiPmInfo {
>      bool s3_disabled;
>      bool s4_disabled;
>      bool pcihp_bridge_en;
> +    bool smi_on_cpuhp;
>      uint8_t s4_val;
>      AcpiFadtData fadt;
>      uint16_t cpu_hp_io_base;
> @@ -194,6 +195,7 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm)
>      pm->cpu_hp_io_base = 0;
>      pm->pcihp_io_base = 0;
>      pm->pcihp_io_len = 0;
> +    pm->smi_on_cpuhp = false;
>
>      assert(obj);
>      init_common_fadt_data(machine, obj, &pm->fadt);
> @@ -207,12 +209,16 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm)
>              object_property_get_uint(obj, ACPI_PCIHP_IO_LEN_PROP, NULL);
>      }
>      if (lpc) {
> +        uint64_t smi_features = object_property_get_uint(lpc,
> +            ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP, NULL);
>          struct AcpiGenericAddress r = { .space_id = AML_AS_SYSTEM_IO,
>              .bit_width = 8, .address = ICH9_RST_CNT_IOPORT };
>          pm->fadt.reset_reg = r;
>          pm->fadt.reset_val = 0xf;
>          pm->fadt.flags |= 1 << ACPI_FADT_F_RESET_REG_SUP;
>          pm->cpu_hp_io_base = ICH9_CPU_HOTPLUG_IO_BASE;
> +        pm->smi_on_cpuhp =
> +            !!(smi_features & BIT_ULL(ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT));
>      }
>
>      /* The above need not be conditional on machine type because the reset port

For sub-patch "b":
Reviewed-by: Laszlo Ersek <lersek@redhat.com>


* Patch "c":

[subject:]

  x86: acpi: introduce the "PCI0.SMI0" ACPI device

[commit message:]

  When CPU hotplug with SMI has been negotiated, describe the SMI
  register block in the DSDT. Pass the ACPI name of the SMI control
  register to build_cpus_aml(), so that CPU_SCAN_METHOD can access the
  register in the next patch.

[code extracted from current patch follows]

> diff --git a/include/hw/acpi/cpu.h b/include/hw/acpi/cpu.h
> index 62f0278ba2..0eeedaa491 100644
> --- a/include/hw/acpi/cpu.h
> +++ b/include/hw/acpi/cpu.h
> @@ -50,6 +50,7 @@ void cpu_hotplug_hw_init(MemoryRegion *as, Object *owner,
>  typedef struct CPUHotplugFeatures {
>      bool acpi_1_compatible;
>      bool has_legacy_cphp;
> +    const char *smi_path;
>  } CPUHotplugFeatures;
>
>  void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index b7bcbbbb2a..2291c050ba 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1515,6 +1521,32 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
>          aml_append(dev, aml_name_decl("_UID", aml_int(1)));
>          aml_append(dev, build_q35_osc_method());
>          aml_append(sb_scope, dev);
> +
> +        if (pm->smi_on_cpuhp) {
> +            /* reserve SMI block resources, IO ports 0xB2, 0xB3 */
> +            dev = aml_device("PCI0.SMI0");
> +            aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A06")));
> +            aml_append(dev, aml_name_decl("_UID", aml_string("SMI resources")));
> +            crs = aml_resource_template();
> +            aml_append(crs,
> +                aml_io(
> +                       AML_DECODE16,
> +                       ACPI_PORT_SMI_CMD,
> +                       ACPI_PORT_SMI_CMD,
> +                       1,
> +                       2)
> +            );
> +            aml_append(dev, aml_name_decl("_CRS", crs));
> +            aml_append(dev, aml_operation_region("SMIR", AML_SYSTEM_IO,
> +                aml_int(ACPI_PORT_SMI_CMD), 2));
> +            field = aml_field("SMIR", AML_BYTE_ACC, AML_NOLOCK,
> +                              AML_WRITE_AS_ZEROS);
> +            aml_append(field, aml_named_field("SMIC", 8));
> +            aml_append(field, aml_reserved_field(8));
> +            aml_append(dev, field);
> +            aml_append(sb_scope, dev);
> +        }
> +
>          aml_append(dsdt, sb_scope);
>
>          build_hpet_aml(dsdt);
> @@ -1530,7 +1562,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
>          build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base);
>      } else {
>          CPUHotplugFeatures opts = {
> -            .acpi_1_compatible = true, .has_legacy_cphp = true
> +            .acpi_1_compatible = true, .has_legacy_cphp = true,
> +            .smi_path = pm->smi_on_cpuhp ? "\\_SB.PCI0.SMI0.SMIC" : NULL,
>          };
>          build_cpus_aml(dsdt, machine, opts, pm->cpu_hp_io_base,
>                         "\\_SB.PCI0", "\\_GPE._E02");

For sub-patch "c":
Reviewed-by: Laszlo Ersek <lersek@redhat.com>


* Patch "d":

[subject: copy from current (large) patch]

[commit message: copy from current (large) patch]

[code extracted from current patch follows, with my comments]

> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
> index 3d6a500fb7..4864c3b396 100644
> --- a/hw/acpi/cpu.c
> +++ b/hw/acpi/cpu.c
> @@ -14,6 +14,8 @@
>  #define ACPI_CPU_CMD_DATA_OFFSET_RW 8
>  #define ACPI_CPU_CMD_DATA2_OFFSET_R 0
>
> +#define OVMF_CPUHP_SMI_CMD 4
> +
>  enum {
>      CPHP_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
>      CPHP_OST_EVENT_CMD = 1,
> @@ -321,6 +323,7 @@ const VMStateDescription vmstate_cpu_hotplug = {
>  #define CPU_NOTIFY_METHOD "CTFY"
>  #define CPU_EJECT_METHOD  "CEJ0"
>  #define CPU_OST_METHOD    "COST"
> +#define CPU_ADDED_LIST    "CNEW"
>
>  #define CPU_ENABLED       "CPEN"
>  #define CPU_SELECTOR      "CSEL"
> @@ -471,21 +474,61 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>              Aml *dev_chk = aml_int(1);
>              Aml *eject_req = aml_int(3);
>              Aml *next_cpu_cmd = aml_int(CPHP_GET_NEXT_CPU_WITH_EVENT_CMD);
> +            Aml *num_added_cpus = aml_local(1);
> +            Aml *cpu_idx = aml_local(2);
> +            Aml *uid = aml_local(3);
> +            Aml *new_cpus = aml_name(CPU_ADDED_LIST);
>
>              aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +
> +            /* use named package as old Windows don't support it in local var */
> +            if (arch_ids->len <= 256) {
> +                /*
> +                 * For compatibility with Windows Server 2008 (max 256 cpus),
> +                 * use ACPI1.0 PackageOp which can cache max 255 elements.
> +                 * Which is fine for 256 CPUs VM. Max 255 can be hotplugged,
> +                 * sice at least one CPU must be already present.
> +                 */
> +                aml_append(method, aml_name_decl(CPU_ADDED_LIST,
> +                    aml_package(arch_ids->len)));
> +            } else {
> +                aml_append(method, aml_name_decl(CPU_ADDED_LIST,
> +                    aml_varpackage(arch_ids->len)));
> +            }

(1) Both package sizes are off-by-one. We only need room for the CPUs
that have events pending, and that excludes CPU#0.

In particular, in the compatibility case, when "arch_ids->len" is
exactly 256, then we pass 256 to aml_package(). However, aml_package()
-- correctly! -- takes:

  uint8_t num_elements

So with 256 VCPUs, we'd pass a zero "num_elements".

The comparison (<= 256) is correct, we just need to pass

  arch_ids->len - 1

to both aml_package() and aml_varpackage().


> +
> +            aml_append(method, aml_store(zero, num_added_cpus));
>              aml_append(method, aml_store(one, has_event));
> -            while_ctx = aml_while(aml_equal(has_event, one));
> +            /* start scan from the 1st CPU */
> +            aml_append(method, aml_store(zero, uid));
> +            while_ctx = aml_while(aml_land(aml_equal(has_event, one),
> +                                  aml_lless(uid, aml_int(arch_ids->len))));
>              {
>                   /* clear loop exit condition, ins_evt/rm_evt checks
>                    * will set it to 1 while next_cpu_cmd returns a CPU
>                    * with events */
>                   aml_append(while_ctx, aml_store(zero, has_event));
> +
> +                 aml_append(while_ctx, aml_store(uid, cpu_selector));
>                   aml_append(while_ctx, aml_store(next_cpu_cmd, cpu_cmd));
> +
> +                 /*
> +                  * wrap around case, scan is complete, exit loop.
> +                  * It happens since events are not cleared in scan loop,
> +                  * so next_cpu_cmd continues to find already processed CPUs.
> +                  */
> +                 ifctx = aml_if(aml_lless(cpu_data, uid));
> +                 {
> +                     aml_append(ifctx, aml_break());
> +                 }
> +                 aml_append(while_ctx, ifctx);
> +
> +                 aml_append(while_ctx, aml_store(cpu_data, uid));
>                   ifctx = aml_if(aml_equal(ins_evt, one));
>                   {
> -                     aml_append(ifctx,
> -                         aml_call2(CPU_NOTIFY_METHOD, cpu_data, dev_chk));
> -                     aml_append(ifctx, aml_store(one, ins_evt));
> +                     /* cache added CPUs to Notify/Wakeup later */
> +                     aml_append(ifctx, aml_store(uid,
> +                         aml_index(new_cpus, num_added_cpus)));
> +                     aml_append(ifctx, aml_increment(num_added_cpus));
>                       aml_append(ifctx, aml_store(one, has_event));
>                   }
>                   aml_append(while_ctx, ifctx);
> @@ -493,14 +536,41 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>                   ifctx = aml_if(aml_equal(rm_evt, one));
>                   {
>                       aml_append(ifctx,
> -                         aml_call2(CPU_NOTIFY_METHOD, cpu_data, eject_req));
> +                         aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
>                       aml_append(ifctx, aml_store(one, rm_evt));
>                       aml_append(ifctx, aml_store(one, has_event));
>                   }
>                   aml_append(else_ctx, ifctx);
>                   aml_append(while_ctx, else_ctx);
> +                 aml_append(while_ctx, aml_increment(uid));
> +            }
> +            aml_append(method, while_ctx);
> +
> +            /*
> +             * in case FW negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT,
> +             * make upcall to FW, so it can pull in new CPUs before
> +             * OS is notified and wakes them up
> +             */
> +            if (opts.smi_path) {
> +                ifctx = aml_if(aml_lnot(aml_equal(num_added_cpus, zero)));

(2) The composed "aml_lnot(aml_equal())" is not wrong, but I think
aml_lgreater() would be simpler (for the QEMU source code anyway).

Feel free to ignore.

> +                {
> +                    aml_append(ifctx, aml_store(aml_int(OVMF_CPUHP_SMI_CMD),
> +                        aml_name("%s", opts.smi_path)));
> +                }
> +                aml_append(method, ifctx);
> +            }
> +
> +            /* Notify OSPM about new CPUs and clear insert events */
> +            aml_append(method, aml_store(zero, cpu_idx));
> +            while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
> +            {
> +                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
> +                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
> +                aml_append(while_ctx, aml_store(one, ins_evt));

(3) Before writing 1 to CINS, you need to set CSEL as well.

This bug is masked if the loop finds *at most* one added CPU. To see
why, here's the decompiled method, with a possible CPUs count of 4, and
the local variable names substituted:

   1  Method (CSCN, 0, Serialized)
   2  {
   3      Acquire (\_SB.PCI0.PRES.CPLK, 0xFFFF)
   4      Name (CNEW, Package (0x04){})
   5
   6      Local_NumAddedCpus = Zero
   7      Local_HasEvent = One
   8      Local_Uid = Zero
   9
  10      While (((Local_HasEvent == One) && (Local_Uid < 0x04)))
  11      {
  12          Local_HasEvent = Zero
  13          \_SB.PCI0.PRES.CSEL = Local_Uid
  14          \_SB.PCI0.PRES.CCMD = Zero
  15          If ((\_SB.PCI0.PRES.CDAT < Local_Uid))
  16          {
  17              Break
  18          }
  19
  20          Local_Uid = \_SB.PCI0.PRES.CDAT
  21          If ((\_SB.PCI0.PRES.CINS == One))
  22          {
  23              CNEW [Local_NumAddedCpus] = Local_Uid
  24              Local_NumAddedCpus++
  25              Local_HasEvent = One
  26          }
  27          ElseIf ((\_SB.PCI0.PRES.CRMV == One))
  28          {
  29              CTFY (Local_Uid, 0x03)
  30              \_SB.PCI0.PRES.CRMV = One
  31              Local_HasEvent = One
  32          }
  33
  34          Local_Uid++
  35      }
  36
  37      If ((Local_NumAddedCpus != Zero))
  38      {
  39          \_SB.PCI0.SMI0.SMIC = 0x04
  40      }
  41
  42      Local_CpuIdx = Zero
  43      While ((Local_CpuIdx < Local_NumAddedCpus))
  44      {
  45          CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
  46          \_SB.PCI0.PRES.CINS = One
  47          Local_CpuIdx++
  48      }
  49
  50      Release (\_SB.PCI0.PRES.CPLK)
  51  }

If the loop has found exactly one VCPU with a pending event, then we
must have exited the loop in one of two ways:

- The UID of the VCPU with the pending event was maximal, and so the
  Local_Uid limit in the While fired on the next iteration (line 10).

- Otherwise, the search wrapped on the next iteration (line 17).

In either case, the loop leaves the "right" CPU selector in place:

- In the first case, when we exit on line 10, we never write to CSEL
  again (line 13).

- In the second case, we wrap back to the *one and only* VCPU with
  pending events, on line 14, before we exit on line 17.

And so when we reach the *one and only* write to CINS on line 46, the
CPU selector happens to be correct.

Because of this, I happened to test this patch successfully with e.g.

  virsh setvcpu DOMAIN 3 --enable --live

(note: "setvcpu", singular).

However, the bug can be triggered when plugging multiple VCPUs in quick
succession, with e.g.

  virsh setvcpus DOMAIN 8 --live

(note "setvcpus", plural).

So I would suggest fetching the CNEW array element back into "uid"
first, then using "uid" for both the NOTIFY call, and the (currently
missing) restoration of CSEL. Then we can write 1 to CINS.

Expressed as a patch on top of yours:

> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
> index 4864c3b39694..2bea6144fd5e 100644
> --- a/hw/acpi/cpu.c
> +++ b/hw/acpi/cpu.c
> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>              aml_append(method, aml_store(zero, cpu_idx));
>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
>              {
> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
> +                aml_append(while_ctx,
> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
> +                aml_append(while_ctx,
> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
>                  aml_append(while_ctx, aml_store(one, ins_evt));
>                  aml_append(while_ctx, aml_increment(cpu_idx));
>              }

This effects the following change, in the decompiled method:

> @@ -37,15 +37,17 @@
>      If ((Local_NumAddedCpus != Zero))
>      {
>          \_SB.PCI0.SMI0.SMIC = 0x04
>      }
>
>      Local_CpuIdx = Zero
>      While ((Local_CpuIdx < Local_NumAddedCpus))
>      {
> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
> +        CTFY (Local_Uid, One)
> +        \_SB.PCI0.PRES.CSEL = Local_Uid
>          \_SB.PCI0.PRES.CINS = One
>          Local_CpuIdx++
>      }
>
>      Release (\_SB.PCI0.PRES.CPLK)
>  }

With this change, the

  virsh setvcpus DOMAIN 8 --live

command works for me. The topology in my test domain has CPU#0 and CPU#2
cold-plugged, so the command adds 6 VCPUs. Viewed from the firmware
side, the 6 "device_add" commands, issued in close succession by
libvirtd, coalesce into 4 "batches". (And of course the firmware sees
the 4 batches back-to-back.)

Note the "PluggedCount" message, which reports the size of each batch:

- Batch#1 adds CPU#1:

> QemuCpuhpCollectApicIds: CurrentSelector=1: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000001
> QemuCpuhpCollectApicIds: CurrentSelector=2 PendingSelector=1: wrap-around
> QemuCpuhpCollectApicIds: PluggedCount=1 ToUnplugCount=0
> CpuHotplugMmi: hot-added APIC ID 0x00000001, SMBASE 0x7FF7A000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 2

- Batch#2 adds CPU#3:

> QemuCpuhpCollectApicIds: CurrentSelector=3: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000003
> QemuCpuhpCollectApicIds: CurrentSelector=4 PendingSelector=3: wrap-around
> QemuCpuhpCollectApicIds: PluggedCount=1 ToUnplugCount=0
> CpuHotplugMmi: hot-added APIC ID 0x00000003, SMBASE 0x7FF7C000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 3

- Batch#3 adds CPU#4, CPU#5, CPU#6:

> QemuCpuhpCollectApicIds: CurrentSelector=4: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000004
> QemuCpuhpCollectApicIds: CurrentSelector=5: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000005
> QemuCpuhpCollectApicIds: CurrentSelector=6: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000006
> QemuCpuhpCollectApicIds: CurrentSelector=7 PendingSelector=4: wrap-around
> QemuCpuhpCollectApicIds: PluggedCount=3 ToUnplugCount=0
> CpuHotplugMmi: hot-added APIC ID 0x00000004, SMBASE 0x7FF7E000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 4
> CpuHotplugMmi: hot-added APIC ID 0x00000005, SMBASE 0x7FF80000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 5
> CpuHotplugMmi: hot-added APIC ID 0x00000006, SMBASE 0x7FF82000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 6

- Batch#4 adds CPU#7:

> QemuCpuhpCollectApicIds: CurrentSelector=7: insert
> QemuCpuhpCollectApicIds: ApicId=0x00000007
> QemuCpuhpCollectApicIds: PluggedCount=1 ToUnplugCount=0
> CpuHotplugMmi: hot-added APIC ID 0x00000007, SMBASE 0x7FF84000, EFI_SMM_CPU_SERVICE_PROTOCOL assigned number 7

Thanks,
Laszlo

On 08/18/20 14:22, Igor Mammedov wrote:
> +                aml_append(while_ctx, aml_increment(cpu_idx));
>              }
>              aml_append(method, while_ctx);
> +
>              aml_append(method, aml_release(ctrl_lock));
>          }
>          aml_append(cpus_dev, method);
Igor Mammedov Aug. 26, 2020, 9:23 a.m. UTC | #2
On Tue, 25 Aug 2020 19:25:48 +0200
Laszlo Ersek <lersek@redhat.com> wrote:

> On 08/18/20 14:22, Igor Mammedov wrote:
> > In case firmware has negotiated CPU hotplug SMI feature, generate
> > AML to describe SMI IO port region and send SMI to firmware
> > on each CPU hotplug SCI in case new CPUs were hotplugged.
> >
> > Since new CPUs can be hotplugged while CPU_SCAN_METHOD is running
> > we can't send SMI before new CPUs are fetched from QEMU as it
> > could cause sending Notify to a CPU that firmware hasn't seen yet.
> > So fetch new CPUs into local cache first and then send SMI and
> > after that sends Notify events to cached CPUs. This should ensure
> > that Notify is sent only to CPUs which were processed by firmware
> > first.
> > Any CPUs that were hotplugged after caching will be processed
> > by the next CPU_SCAN_METHOD, when pending SCI is handled.
> >
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> > v2:
> >  - clear insert event after firmware has returned
> >    control from SMI. (Laszlo Ersek <lersek@redhat.com>)

Laszlo,
Thanks a lot for such detailed review
I'll respin series with your feed back incorporated.

[...]
Laszlo Ersek Aug. 26, 2020, 9:24 a.m. UTC | #3
Hi Igor,

On 08/25/20 19:25, Laszlo Ersek wrote:

> So I would suggest fetching the CNEW array element back into "uid"
> first, then using "uid" for both the NOTIFY call, and the (currently
> missing) restoration of CSEL. Then we can write 1 to CINS.
>
> Expressed as a patch on top of yours:
>
>> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
>> index 4864c3b39694..2bea6144fd5e 100644
>> --- a/hw/acpi/cpu.c
>> +++ b/hw/acpi/cpu.c
>> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>>              aml_append(method, aml_store(zero, cpu_idx));
>>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
>>              {
>> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
>> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
>> +                aml_append(while_ctx,
>> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
>> +                aml_append(while_ctx,
>> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
>> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
>>                  aml_append(while_ctx, aml_store(one, ins_evt));
>>                  aml_append(while_ctx, aml_increment(cpu_idx));
>>              }
>
> This effects the following change, in the decompiled method:
>
>> @@ -37,15 +37,17 @@
>>      If ((Local_NumAddedCpus != Zero))
>>      {
>>          \_SB.PCI0.SMI0.SMIC = 0x04
>>      }
>>
>>      Local_CpuIdx = Zero
>>      While ((Local_CpuIdx < Local_NumAddedCpus))
>>      {
>> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
>> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
>> +        CTFY (Local_Uid, One)
>> +        \_SB.PCI0.PRES.CSEL = Local_Uid
>>          \_SB.PCI0.PRES.CINS = One
>>          Local_CpuIdx++
>>      }
>>
>>      Release (\_SB.PCI0.PRES.CPLK)
>>  }
>
> With this change, the
>
>   virsh setvcpus DOMAIN 8 --live
>
> command works for me. The topology in my test domain has CPU#0 and
> CPU#2 cold-plugged, so the command adds 6 VCPUs. Viewed from the
> firmware side, the 6 "device_add" commands, issued in close succession
> by libvirtd, coalesce into 4 "batches". (And of course the firmware
> sees the 4 batches back-to-back.)

unfortunately, with more testing, I have run into two more races:

(1) When a "device_add" occurs after the ACPI loop collects the CPUS
    from the register block, but before the SMI.

    Here, the "stray CPU" is processed fine by the firmware. However,
    the CTFY loop in ACPI does not know about the CPU, so it doesn't
    clear the pending insert event for it. And when the firmware is
    entered with an SMI for the *next* time, the firmware sees the same
    CPU *again* as pending, and tries to relocate it again. Bad things
    happen.

(2) When a "device_add" occurs after the SMI, but before the firmware
    collects the pending CPUs from the register block.

    Here, the firmware collects the "stray CPU". However, the "broadcast
    SMI", with which we entered the firmware, did *not* cover the stray
    CPU -- the CPU_FOREACH() loop in ich9_apm_ctrl_changed() could not
    make the SMI pending for the new CPU, because at that time, the CPU
    had not been added yet. As a result, when the firmware sends an
    INIT-SIPI-SIPI to the new CPU, expecting it to boot right into SMM,
    the new CPU instead boots straight into the post-RSM (normal mode)
    "pen", skipping its initial SMI handler. Meaning that the CPU halts
    nicely, but its SMBASE is never relocated, and the SMRAM message
    exchange with the BSP falls apart.

Possible mitigations I can think of:

For problem (1):

  (1a) Change the firmware so it notices that it has relocated the
       "stray" CPU before -- such CPUs should be simply skipped in the
       firmware. The next time the CTFY loop runs in ACPI, it will clear
       the pending event.

  (1b) Alternatively, stop consuming the hotplug register block in the
       firmware altogether, and work out general message passing, from
       ACPI to firmware. See the outline here:

         http://mid.mail-archive.com/cf887d74-f65d-602a-9629-3d25cef93a69@redhat.com

For problem (2):

  (2a) Change the firmware so that it sends a directed SMI as well to
       each CPU, just before sending an INIT-SIPI-SIPI. This should be
       idempotent -- if the broadcast SMI *has* covered the the CPU,
       then sending a directed SMI should make no difference.

  (2b) Alternatively, change the "device_add" command in QEMU so that,
       if "CPU hotplug with SMI" has been negotiated, the new CPU is
       added with the SMI made pending for it at once. (That is, no
       hot-plugged CPU would exist with the directed SMI *not* pending
       for it.)

  (2c) Alternatively, approach (1b) would fix problem (2) as well -- the
       firmware would only relocate such CPUs that ACPI collected before
       injecting the SMI. So all those CPUs would have the SMI pending.


I can experiment with (1a) and (2a), but note that for unplug, we're
likely going to need the array passing method (1b) anyway. I'm concerned
that (1a) and (2a) together may not fix all possible races (i.e. we
might see more kinds).

What's important is that, *IF* we need to change anything in the
firmware for fixing these races, then we either need to do so urgently,
*or* I have to revert OVMF commit 5ba203b54e59 ("OvmfPkg/SmmControl2Dxe:
negotiate ICH9_LPC_SMI_F_CPU_HOTPLUG", 2020-08-24).

That's because we're in feature freeze for edk2-stable202008

  https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning#edk2-stable202008-tag-planning

and we shouldn't make a release with ICH9_LPC_SMI_F_CPU_HOTPLUG
negotiated, knowing that there are races.


Please advise!

Thanks,
Laszlo
Igor Mammedov Aug. 26, 2020, 11:55 a.m. UTC | #4
On Wed, 26 Aug 2020 11:24:14 +0200
Laszlo Ersek <lersek@redhat.com> wrote:

> Hi Igor,
> 
> On 08/25/20 19:25, Laszlo Ersek wrote:
> 
> > So I would suggest fetching the CNEW array element back into "uid"
> > first, then using "uid" for both the NOTIFY call, and the (currently
> > missing) restoration of CSEL. Then we can write 1 to CINS.
> >
> > Expressed as a patch on top of yours:
> >  
> >> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
> >> index 4864c3b39694..2bea6144fd5e 100644
> >> --- a/hw/acpi/cpu.c
> >> +++ b/hw/acpi/cpu.c
> >> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
> >>              aml_append(method, aml_store(zero, cpu_idx));
> >>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
> >>              {
> >> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
> >> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
> >> +                aml_append(while_ctx,
> >> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
> >> +                aml_append(while_ctx,
> >> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> >> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> >>                  aml_append(while_ctx, aml_store(one, ins_evt));
> >>                  aml_append(while_ctx, aml_increment(cpu_idx));
> >>              }  
> >
> > This effects the following change, in the decompiled method:
> >  
> >> @@ -37,15 +37,17 @@
> >>      If ((Local_NumAddedCpus != Zero))
> >>      {
> >>          \_SB.PCI0.SMI0.SMIC = 0x04
> >>      }
> >>
> >>      Local_CpuIdx = Zero
> >>      While ((Local_CpuIdx < Local_NumAddedCpus))
> >>      {
> >> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
> >> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
> >> +        CTFY (Local_Uid, One)
> >> +        \_SB.PCI0.PRES.CSEL = Local_Uid
> >>          \_SB.PCI0.PRES.CINS = One
> >>          Local_CpuIdx++
> >>      }
> >>
> >>      Release (\_SB.PCI0.PRES.CPLK)
> >>  }  
> >
> > With this change, the
> >
> >   virsh setvcpus DOMAIN 8 --live
> >
> > command works for me. The topology in my test domain has CPU#0 and
> > CPU#2 cold-plugged, so the command adds 6 VCPUs. Viewed from the
> > firmware side, the 6 "device_add" commands, issued in close succession
> > by libvirtd, coalesce into 4 "batches". (And of course the firmware
> > sees the 4 batches back-to-back.)  
> 
> unfortunately, with more testing, I have run into two more races:
> 
> (1) When a "device_add" occurs after the ACPI loop collects the CPUS
>     from the register block, but before the SMI.
> 
>     Here, the "stray CPU" is processed fine by the firmware. However,
>     the CTFY loop in ACPI does not know about the CPU, so it doesn't
>     clear the pending insert event for it. And when the firmware is
>     entered with an SMI for the *next* time, the firmware sees the same
>     CPU *again* as pending, and tries to relocate it again. Bad things
>     happen.
>
> (2) When a "device_add" occurs after the SMI, but before the firmware
>     collects the pending CPUs from the register block.
> 
>     Here, the firmware collects the "stray CPU". However, the "broadcast
>     SMI", with which we entered the firmware, did *not* cover the stray
>     CPU -- the CPU_FOREACH() loop in ich9_apm_ctrl_changed() could not
>     make the SMI pending for the new CPU, because at that time, the CPU
>     had not been added yet. As a result, when the firmware sends an
>     INIT-SIPI-SIPI to the new CPU, expecting it to boot right into SMM,
>     the new CPU instead boots straight into the post-RSM (normal mode)
>     "pen", skipping its initial SMI handler. Meaning that the CPU halts
>     nicely, but its SMBASE is never relocated, and the SMRAM message
>     exchange with the BSP falls apart.
> 
> Possible mitigations I can think of:
> 
> For problem (1):
> 
>   (1a) Change the firmware so it notices that it has relocated the
>        "stray" CPU before -- such CPUs should be simply skipped in the
>        firmware. The next time the CTFY loop runs in ACPI, it will clear
>        the pending event.
> 
>   (1b) Alternatively, stop consuming the hotplug register block in the
>        firmware altogether, and work out general message passing, from
>        ACPI to firmware. See the outline here:
> 
>          http://mid.mail-archive.com/cf887d74-f65d-602a-9629-3d25cef93a69@redhat.com
> 
> For problem (2):
> 
>   (2a) Change the firmware so that it sends a directed SMI as well to
>        each CPU, just before sending an INIT-SIPI-SIPI. This should be
>        idempotent -- if the broadcast SMI *has* covered the the CPU,
>        then sending a directed SMI should make no difference.
may be still racy, as new cpus can arrive diring/after direct broadcast.
better would be ignore stray CPUs that didn't pass through SMM
relocation.

> 
>   (2b) Alternatively, change the "device_add" command in QEMU so that,
>        if "CPU hotplug with SMI" has been negotiated, the new CPU is
>        added with the SMI made pending for it at once. (That is, no
>        hot-plugged CPU would exist with the directed SMI *not* pending
>        for it.)
> 
>   (2c) Alternatively, approach (1b) would fix problem (2) as well -- the
>        firmware would only relocate such CPUs that ACPI collected before
>        injecting the SMI. So all those CPUs would have the SMI pending.
> 
> 
> I can experiment with (1a) and (2a), but note that for unplug, we're
> likely going to need the array passing method (1b) anyway. I'm concerned
> that (1a) and (2a) together may not fix all possible races (i.e. we
> might see more kinds).

Given async nature of device_add, (1a) and (2a) are expected
(I've thought it were already handled by fw),
so I'd prefer to continue with current approach if possible.
(if fw is not able to deal with 2a without QEMU help,
maybe we could add an extra flag in register block to help,
though my preference is to avoid it if it's possible)

(2b) if twisting CPU state in QEMU were acceptable we would have
had a lot of helping quirks that would put CPU in some desired
for a a particular guest code state, and why stop there just
discard firmware and configure all devices right from QEMU
to and start executing OS code strait away. (it would be
faster than going through what real hw does) (I'm afraid it's
not going to fly).

(1c) it's still very unclear what direction to take there,
 * one as you mentioned, would be introducing new protocol for
   message passing between acpi and fw.
   (so far I've was thinking that payload should be UID of CPU)
 * another approach, I was considering was adding an additional
   flag to register block. So CPU._EJ0 would mark CPU as pending
   for removal and do SMI upcall to fw to act on it.
 
> What's important is that, *IF* we need to change anything in the
> firmware for fixing these races, then we either need to do so urgently,
> *or* I have to revert OVMF commit 5ba203b54e59 ("OvmfPkg/SmmControl2Dxe:
> negotiate ICH9_LPC_SMI_F_CPU_HOTPLUG", 2020-08-24).
> 
> That's because we're in feature freeze for edk2-stable202008
> 
>   https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning#edk2-stable202008-tag-planning
> 
> and we shouldn't make a release with ICH9_LPC_SMI_F_CPU_HOTPLUG
> negotiated, knowing that there are races.
> 
> 
> Please advise!
> 
> Thanks,
> Laszlo
Laszlo Ersek Aug. 26, 2020, 1:32 p.m. UTC | #5
On 08/26/20 11:24, Laszlo Ersek wrote:
> Hi Igor,
> 
> On 08/25/20 19:25, Laszlo Ersek wrote:
> 
>> So I would suggest fetching the CNEW array element back into "uid"
>> first, then using "uid" for both the NOTIFY call, and the (currently
>> missing) restoration of CSEL. Then we can write 1 to CINS.
>>
>> Expressed as a patch on top of yours:
>>
>>> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
>>> index 4864c3b39694..2bea6144fd5e 100644
>>> --- a/hw/acpi/cpu.c
>>> +++ b/hw/acpi/cpu.c
>>> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>>>              aml_append(method, aml_store(zero, cpu_idx));
>>>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
>>>              {
>>> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
>>> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
>>> +                aml_append(while_ctx,
>>> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
>>> +                aml_append(while_ctx,
>>> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
>>> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
>>>                  aml_append(while_ctx, aml_store(one, ins_evt));
>>>                  aml_append(while_ctx, aml_increment(cpu_idx));
>>>              }
>>
>> This effects the following change, in the decompiled method:
>>
>>> @@ -37,15 +37,17 @@
>>>      If ((Local_NumAddedCpus != Zero))
>>>      {
>>>          \_SB.PCI0.SMI0.SMIC = 0x04
>>>      }
>>>
>>>      Local_CpuIdx = Zero
>>>      While ((Local_CpuIdx < Local_NumAddedCpus))
>>>      {
>>> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
>>> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
>>> +        CTFY (Local_Uid, One)
>>> +        \_SB.PCI0.PRES.CSEL = Local_Uid
>>>          \_SB.PCI0.PRES.CINS = One
>>>          Local_CpuIdx++
>>>      }
>>>
>>>      Release (\_SB.PCI0.PRES.CPLK)
>>>  }
>>
>> With this change, the
>>
>>   virsh setvcpus DOMAIN 8 --live
>>
>> command works for me. The topology in my test domain has CPU#0 and
>> CPU#2 cold-plugged, so the command adds 6 VCPUs. Viewed from the
>> firmware side, the 6 "device_add" commands, issued in close succession
>> by libvirtd, coalesce into 4 "batches". (And of course the firmware
>> sees the 4 batches back-to-back.)
> 
> unfortunately, with more testing, I have run into two more races:
> 
> (1) When a "device_add" occurs after the ACPI loop collects the CPUS
>     from the register block, but before the SMI.
> 
>     Here, the "stray CPU" is processed fine by the firmware. However,
>     the CTFY loop in ACPI does not know about the CPU, so it doesn't
>     clear the pending insert event for it. And when the firmware is
>     entered with an SMI for the *next* time, the firmware sees the same
>     CPU *again* as pending, and tries to relocate it again. Bad things
>     happen.
> 
> (2) When a "device_add" occurs after the SMI, but before the firmware
>     collects the pending CPUs from the register block.
> 
>     Here, the firmware collects the "stray CPU". However, the "broadcast
>     SMI", with which we entered the firmware, did *not* cover the stray
>     CPU -- the CPU_FOREACH() loop in ich9_apm_ctrl_changed() could not
>     make the SMI pending for the new CPU, because at that time, the CPU
>     had not been added yet. As a result, when the firmware sends an
>     INIT-SIPI-SIPI to the new CPU, expecting it to boot right into SMM,
>     the new CPU instead boots straight into the post-RSM (normal mode)
>     "pen", skipping its initial SMI handler. Meaning that the CPU halts
>     nicely, but its SMBASE is never relocated, and the SMRAM message
>     exchange with the BSP falls apart.
> 
> Possible mitigations I can think of:
> 
> For problem (1):
> 
>   (1a) Change the firmware so it notices that it has relocated the
>        "stray" CPU before -- such CPUs should be simply skipped in the
>        firmware. The next time the CTFY loop runs in ACPI, it will clear
>        the pending event.
> 
>   (1b) Alternatively, stop consuming the hotplug register block in the
>        firmware altogether, and work out general message passing, from
>        ACPI to firmware. See the outline here:
> 
>          http://mid.mail-archive.com/cf887d74-f65d-602a-9629-3d25cef93a69@redhat.com
> 
> For problem (2):
> 
>   (2a) Change the firmware so that it sends a directed SMI as well to
>        each CPU, just before sending an INIT-SIPI-SIPI. This should be
>        idempotent -- if the broadcast SMI *has* covered the the CPU,
>        then sending a directed SMI should make no difference.
> 
>   (2b) Alternatively, change the "device_add" command in QEMU so that,
>        if "CPU hotplug with SMI" has been negotiated, the new CPU is
>        added with the SMI made pending for it at once. (That is, no
>        hot-plugged CPU would exist with the directed SMI *not* pending
>        for it.)
> 
>   (2c) Alternatively, approach (1b) would fix problem (2) as well -- the
>        firmware would only relocate such CPUs that ACPI collected before
>        injecting the SMI. So all those CPUs would have the SMI pending.
> 
> 
> I can experiment with (1a) and (2a),

My patches for (1a) and (1b) seem to work -- my workstation has 10
PCPUs, and I'm using a guest with 20 possible VCPUs and 2 cold-plugged
VCPUs on it, for testing. The patches survive the hot-plugging of 18
VCPUs in one go, or two batches like 9+9. I can see the fixes being
exercised.

Unless you strongly disagree (or I find issues in further testing), I
propose that I post these fixes to edk2-devel (they should still be in
scope for the upcoming release), and that we stick with your current
patch series for QEMU (v3 -- upcoming, or maybe already posted).

Thanks!
Laszlo
Laszlo Ersek Aug. 26, 2020, 1:32 p.m. UTC | #6
On 08/26/20 15:32, Laszlo Ersek wrote:
> On 08/26/20 11:24, Laszlo Ersek wrote:
>> Hi Igor,
>>
>> On 08/25/20 19:25, Laszlo Ersek wrote:
>>
>>> So I would suggest fetching the CNEW array element back into "uid"
>>> first, then using "uid" for both the NOTIFY call, and the (currently
>>> missing) restoration of CSEL. Then we can write 1 to CINS.
>>>
>>> Expressed as a patch on top of yours:
>>>
>>>> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
>>>> index 4864c3b39694..2bea6144fd5e 100644
>>>> --- a/hw/acpi/cpu.c
>>>> +++ b/hw/acpi/cpu.c
>>>> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
>>>>              aml_append(method, aml_store(zero, cpu_idx));
>>>>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
>>>>              {
>>>> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
>>>> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
>>>> +                aml_append(while_ctx,
>>>> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
>>>> +                aml_append(while_ctx,
>>>> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
>>>> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
>>>>                  aml_append(while_ctx, aml_store(one, ins_evt));
>>>>                  aml_append(while_ctx, aml_increment(cpu_idx));
>>>>              }
>>>
>>> This effects the following change, in the decompiled method:
>>>
>>>> @@ -37,15 +37,17 @@
>>>>      If ((Local_NumAddedCpus != Zero))
>>>>      {
>>>>          \_SB.PCI0.SMI0.SMIC = 0x04
>>>>      }
>>>>
>>>>      Local_CpuIdx = Zero
>>>>      While ((Local_CpuIdx < Local_NumAddedCpus))
>>>>      {
>>>> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
>>>> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
>>>> +        CTFY (Local_Uid, One)
>>>> +        \_SB.PCI0.PRES.CSEL = Local_Uid
>>>>          \_SB.PCI0.PRES.CINS = One
>>>>          Local_CpuIdx++
>>>>      }
>>>>
>>>>      Release (\_SB.PCI0.PRES.CPLK)
>>>>  }
>>>
>>> With this change, the
>>>
>>>   virsh setvcpus DOMAIN 8 --live
>>>
>>> command works for me. The topology in my test domain has CPU#0 and
>>> CPU#2 cold-plugged, so the command adds 6 VCPUs. Viewed from the
>>> firmware side, the 6 "device_add" commands, issued in close succession
>>> by libvirtd, coalesce into 4 "batches". (And of course the firmware
>>> sees the 4 batches back-to-back.)
>>
>> unfortunately, with more testing, I have run into two more races:
>>
>> (1) When a "device_add" occurs after the ACPI loop collects the CPUS
>>     from the register block, but before the SMI.
>>
>>     Here, the "stray CPU" is processed fine by the firmware. However,
>>     the CTFY loop in ACPI does not know about the CPU, so it doesn't
>>     clear the pending insert event for it. And when the firmware is
>>     entered with an SMI for the *next* time, the firmware sees the same
>>     CPU *again* as pending, and tries to relocate it again. Bad things
>>     happen.
>>
>> (2) When a "device_add" occurs after the SMI, but before the firmware
>>     collects the pending CPUs from the register block.
>>
>>     Here, the firmware collects the "stray CPU". However, the "broadcast
>>     SMI", with which we entered the firmware, did *not* cover the stray
>>     CPU -- the CPU_FOREACH() loop in ich9_apm_ctrl_changed() could not
>>     make the SMI pending for the new CPU, because at that time, the CPU
>>     had not been added yet. As a result, when the firmware sends an
>>     INIT-SIPI-SIPI to the new CPU, expecting it to boot right into SMM,
>>     the new CPU instead boots straight into the post-RSM (normal mode)
>>     "pen", skipping its initial SMI handler. Meaning that the CPU halts
>>     nicely, but its SMBASE is never relocated, and the SMRAM message
>>     exchange with the BSP falls apart.
>>
>> Possible mitigations I can think of:
>>
>> For problem (1):
>>
>>   (1a) Change the firmware so it notices that it has relocated the
>>        "stray" CPU before -- such CPUs should be simply skipped in the
>>        firmware. The next time the CTFY loop runs in ACPI, it will clear
>>        the pending event.
>>
>>   (1b) Alternatively, stop consuming the hotplug register block in the
>>        firmware altogether, and work out general message passing, from
>>        ACPI to firmware. See the outline here:
>>
>>          http://mid.mail-archive.com/cf887d74-f65d-602a-9629-3d25cef93a69@redhat.com
>>
>> For problem (2):
>>
>>   (2a) Change the firmware so that it sends a directed SMI as well to
>>        each CPU, just before sending an INIT-SIPI-SIPI. This should be
>>        idempotent -- if the broadcast SMI *has* covered the the CPU,
>>        then sending a directed SMI should make no difference.
>>
>>   (2b) Alternatively, change the "device_add" command in QEMU so that,
>>        if "CPU hotplug with SMI" has been negotiated, the new CPU is
>>        added with the SMI made pending for it at once. (That is, no
>>        hot-plugged CPU would exist with the directed SMI *not* pending
>>        for it.)
>>
>>   (2c) Alternatively, approach (1b) would fix problem (2) as well -- the
>>        firmware would only relocate such CPUs that ACPI collected before
>>        injecting the SMI. So all those CPUs would have the SMI pending.
>>
>>
>> I can experiment with (1a) and (2a),
> 
> My patches for (1a) and (1b) seem to work -- my workstation has 10

aargh, I meant my patches for (1a) and (2a).

sorry
Laszlo

> PCPUs, and I'm using a guest with 20 possible VCPUs and 2 cold-plugged
> VCPUs on it, for testing. The patches survive the hot-plugging of 18
> VCPUs in one go, or two batches like 9+9. I can see the fixes being
> exercised.
> 
> Unless you strongly disagree (or I find issues in further testing), I
> propose that I post these fixes to edk2-devel (they should still be in
> scope for the upcoming release), and that we stick with your current
> patch series for QEMU (v3 -- upcoming, or maybe already posted).
> 
> Thanks!
> Laszlo
>
Igor Mammedov Aug. 26, 2020, 2:10 p.m. UTC | #7
On Wed, 26 Aug 2020 15:32:07 +0200
Laszlo Ersek <lersek@redhat.com> wrote:

> On 08/26/20 11:24, Laszlo Ersek wrote:
> > Hi Igor,
> > 
> > On 08/25/20 19:25, Laszlo Ersek wrote:
> >   
> >> So I would suggest fetching the CNEW array element back into "uid"
> >> first, then using "uid" for both the NOTIFY call, and the (currently
> >> missing) restoration of CSEL. Then we can write 1 to CINS.
> >>
> >> Expressed as a patch on top of yours:
> >>  
> >>> diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
> >>> index 4864c3b39694..2bea6144fd5e 100644
> >>> --- a/hw/acpi/cpu.c
> >>> +++ b/hw/acpi/cpu.c
> >>> @@ -564,8 +564,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
> >>>              aml_append(method, aml_store(zero, cpu_idx));
> >>>              while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
> >>>              {
> >>> -                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
> >>> -                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
> >>> +                aml_append(while_ctx,
> >>> +                    aml_store(aml_derefof(aml_index(new_cpus, cpu_idx)), uid));
> >>> +                aml_append(while_ctx,
> >>> +                    aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> >>> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> >>>                  aml_append(while_ctx, aml_store(one, ins_evt));
> >>>                  aml_append(while_ctx, aml_increment(cpu_idx));
> >>>              }  
> >>
> >> This effects the following change, in the decompiled method:
> >>  
> >>> @@ -37,15 +37,17 @@
> >>>      If ((Local_NumAddedCpus != Zero))
> >>>      {
> >>>          \_SB.PCI0.SMI0.SMIC = 0x04
> >>>      }
> >>>
> >>>      Local_CpuIdx = Zero
> >>>      While ((Local_CpuIdx < Local_NumAddedCpus))
> >>>      {
> >>> -        CTFY (DerefOf (CNEW [Local_CpuIdx]), One)
> >>> +        Local_Uid = DerefOf (CNEW [Local_CpuIdx])
> >>> +        CTFY (Local_Uid, One)
> >>> +        \_SB.PCI0.PRES.CSEL = Local_Uid
> >>>          \_SB.PCI0.PRES.CINS = One
> >>>          Local_CpuIdx++
> >>>      }
> >>>
> >>>      Release (\_SB.PCI0.PRES.CPLK)
> >>>  }  
> >>
> >> With this change, the
> >>
> >>   virsh setvcpus DOMAIN 8 --live
> >>
> >> command works for me. The topology in my test domain has CPU#0 and
> >> CPU#2 cold-plugged, so the command adds 6 VCPUs. Viewed from the
> >> firmware side, the 6 "device_add" commands, issued in close succession
> >> by libvirtd, coalesce into 4 "batches". (And of course the firmware
> >> sees the 4 batches back-to-back.)  
> > 
> > unfortunately, with more testing, I have run into two more races:
> > 
> > (1) When a "device_add" occurs after the ACPI loop collects the CPUS
> >     from the register block, but before the SMI.
> > 
> >     Here, the "stray CPU" is processed fine by the firmware. However,
> >     the CTFY loop in ACPI does not know about the CPU, so it doesn't
> >     clear the pending insert event for it. And when the firmware is
> >     entered with an SMI for the *next* time, the firmware sees the same
> >     CPU *again* as pending, and tries to relocate it again. Bad things
> >     happen.
> > 
> > (2) When a "device_add" occurs after the SMI, but before the firmware
> >     collects the pending CPUs from the register block.
> > 
> >     Here, the firmware collects the "stray CPU". However, the "broadcast
> >     SMI", with which we entered the firmware, did *not* cover the stray
> >     CPU -- the CPU_FOREACH() loop in ich9_apm_ctrl_changed() could not
> >     make the SMI pending for the new CPU, because at that time, the CPU
> >     had not been added yet. As a result, when the firmware sends an
> >     INIT-SIPI-SIPI to the new CPU, expecting it to boot right into SMM,
> >     the new CPU instead boots straight into the post-RSM (normal mode)
> >     "pen", skipping its initial SMI handler. Meaning that the CPU halts
> >     nicely, but its SMBASE is never relocated, and the SMRAM message
> >     exchange with the BSP falls apart.
> > 
> > Possible mitigations I can think of:
> > 
> > For problem (1):
> > 
> >   (1a) Change the firmware so it notices that it has relocated the
> >        "stray" CPU before -- such CPUs should be simply skipped in the
> >        firmware. The next time the CTFY loop runs in ACPI, it will clear
> >        the pending event.
> > 
> >   (1b) Alternatively, stop consuming the hotplug register block in the
> >        firmware altogether, and work out general message passing, from
> >        ACPI to firmware. See the outline here:
> > 
> >          http://mid.mail-archive.com/cf887d74-f65d-602a-9629-3d25cef93a69@redhat.com
> > 
> > For problem (2):
> > 
> >   (2a) Change the firmware so that it sends a directed SMI as well to
> >        each CPU, just before sending an INIT-SIPI-SIPI. This should be
> >        idempotent -- if the broadcast SMI *has* covered the the CPU,
> >        then sending a directed SMI should make no difference.
> > 
> >   (2b) Alternatively, change the "device_add" command in QEMU so that,
> >        if "CPU hotplug with SMI" has been negotiated, the new CPU is
> >        added with the SMI made pending for it at once. (That is, no
> >        hot-plugged CPU would exist with the directed SMI *not* pending
> >        for it.)
> > 
> >   (2c) Alternatively, approach (1b) would fix problem (2) as well -- the
> >        firmware would only relocate such CPUs that ACPI collected before
> >        injecting the SMI. So all those CPUs would have the SMI pending.
> > 
> > 
> > I can experiment with (1a) and (2a),  
> 
> My patches for (1a) and (1b) seem to work -- my workstation has 10
> PCPUs, and I'm using a guest with 20 possible VCPUs and 2 cold-plugged
> VCPUs on it, for testing. The patches survive the hot-plugging of 18
> VCPUs in one go, or two batches like 9+9. I can see the fixes being
> exercised.
> 
> Unless you strongly disagree (or I find issues in further testing), I
> propose that I post these fixes to edk2-devel (they should still be in
> scope for the upcoming release), and that we stick with your current
> patch series for QEMU (v3 -- upcoming, or maybe already posted).

Lets do as you suggest.
As for QEMU side, I'll try to post next revision next week.

> 
> Thanks!
> Laszlo
> 
>
Laszlo Ersek Aug. 26, 2020, 3:12 p.m. UTC | #8
On 08/26/20 13:55, Igor Mammedov wrote:
> On Wed, 26 Aug 2020 11:24:14 +0200
> Laszlo Ersek <lersek@redhat.com> wrote:

>>   (2a) Change the firmware so that it sends a directed SMI as well to
>>        each CPU, just before sending an INIT-SIPI-SIPI. This should be
>>        idempotent -- if the broadcast SMI *has* covered the the CPU,
>>        then sending a directed SMI should make no difference.
> may be still racy, as new cpus can arrive diring/after direct broadcast.

(I think you meant "direct SMI")

That's not a problem -- the point is that we must never send
INIT-SIPI-SIPI to a hot-added CPU without making an SMI pending for it.

The above condition can be satisfied by not sending INIT-SIPI-SIPI to a
VCPU at all.

The firmware collects pending CPUs into an array, and then does the
directed SMI + INIT-SIPI-SIPI dance for each, in a separate loop.

So if a new VCPU is hot-added while we are sending the interrupts to the
already collected ones, that's fine -- we're not going to send *either*
SMI *or* INIT-SIPI-SIPI to that VCPU, until the next time we collect VCPUS.

It's basically the same idea as in your ACPI patch for QEMU.

I'll send the OVMF patches soon.

Thanks!
Laszlo
diff mbox series

Patch

diff --git a/include/hw/acpi/cpu.h b/include/hw/acpi/cpu.h
index 62f0278ba2..0eeedaa491 100644
--- a/include/hw/acpi/cpu.h
+++ b/include/hw/acpi/cpu.h
@@ -50,6 +50,7 @@  void cpu_hotplug_hw_init(MemoryRegion *as, Object *owner,
 typedef struct CPUHotplugFeatures {
     bool acpi_1_compatible;
     bool has_legacy_cphp;
+    const char *smi_path;
 } CPUHotplugFeatures;
 
 void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
index d1bb3f7bf0..0f43ef2481 100644
--- a/include/hw/i386/ich9.h
+++ b/include/hw/i386/ich9.h
@@ -245,6 +245,8 @@  typedef struct ICH9LPCState {
 #define ICH9_SMB_HST_D1                         0x06
 #define ICH9_SMB_HOST_BLOCK_DB                  0x07
 
+#define ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP "x-smi-negotiated-features"
+
 /* bit positions used in fw_cfg SMI feature negotiation */
 #define ICH9_LPC_SMI_F_BROADCAST_BIT            0
 #define ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT          1
diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
index 3d6a500fb7..4864c3b396 100644
--- a/hw/acpi/cpu.c
+++ b/hw/acpi/cpu.c
@@ -14,6 +14,8 @@ 
 #define ACPI_CPU_CMD_DATA_OFFSET_RW 8
 #define ACPI_CPU_CMD_DATA2_OFFSET_R 0
 
+#define OVMF_CPUHP_SMI_CMD 4
+
 enum {
     CPHP_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
     CPHP_OST_EVENT_CMD = 1,
@@ -321,6 +323,7 @@  const VMStateDescription vmstate_cpu_hotplug = {
 #define CPU_NOTIFY_METHOD "CTFY"
 #define CPU_EJECT_METHOD  "CEJ0"
 #define CPU_OST_METHOD    "COST"
+#define CPU_ADDED_LIST    "CNEW"
 
 #define CPU_ENABLED       "CPEN"
 #define CPU_SELECTOR      "CSEL"
@@ -471,21 +474,61 @@  void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
             Aml *dev_chk = aml_int(1);
             Aml *eject_req = aml_int(3);
             Aml *next_cpu_cmd = aml_int(CPHP_GET_NEXT_CPU_WITH_EVENT_CMD);
+            Aml *num_added_cpus = aml_local(1);
+            Aml *cpu_idx = aml_local(2);
+            Aml *uid = aml_local(3);
+            Aml *new_cpus = aml_name(CPU_ADDED_LIST);
 
             aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
+
+            /* use named package as old Windows don't support it in local var */
+            if (arch_ids->len <= 256) {
+                /*
+                 * For compatibility with Windows Server 2008 (max 256 cpus),
+                 * use ACPI1.0 PackageOp which can cache max 255 elements.
+                 * Which is fine for 256 CPUs VM. Max 255 can be hotplugged,
+                 * sice at least one CPU must be already present.
+                 */
+                aml_append(method, aml_name_decl(CPU_ADDED_LIST,
+                    aml_package(arch_ids->len)));
+            } else {
+                aml_append(method, aml_name_decl(CPU_ADDED_LIST,
+                    aml_varpackage(arch_ids->len)));
+            }
+
+            aml_append(method, aml_store(zero, num_added_cpus));
             aml_append(method, aml_store(one, has_event));
-            while_ctx = aml_while(aml_equal(has_event, one));
+            /* start scan from the 1st CPU */
+            aml_append(method, aml_store(zero, uid));
+            while_ctx = aml_while(aml_land(aml_equal(has_event, one),
+                                  aml_lless(uid, aml_int(arch_ids->len))));
             {
                  /* clear loop exit condition, ins_evt/rm_evt checks
                   * will set it to 1 while next_cpu_cmd returns a CPU
                   * with events */
                  aml_append(while_ctx, aml_store(zero, has_event));
+
+                 aml_append(while_ctx, aml_store(uid, cpu_selector));
                  aml_append(while_ctx, aml_store(next_cpu_cmd, cpu_cmd));
+
+                 /*
+                  * wrap around case, scan is complete, exit loop.
+                  * It happens since events are not cleared in scan loop,
+                  * so next_cpu_cmd continues to find already processed CPUs.
+                  */
+                 ifctx = aml_if(aml_lless(cpu_data, uid));
+                 {
+                     aml_append(ifctx, aml_break());
+                 }
+                 aml_append(while_ctx, ifctx);
+
+                 aml_append(while_ctx, aml_store(cpu_data, uid));
                  ifctx = aml_if(aml_equal(ins_evt, one));
                  {
-                     aml_append(ifctx,
-                         aml_call2(CPU_NOTIFY_METHOD, cpu_data, dev_chk));
-                     aml_append(ifctx, aml_store(one, ins_evt));
+                     /* cache added CPUs to Notify/Wakeup later */
+                     aml_append(ifctx, aml_store(uid,
+                         aml_index(new_cpus, num_added_cpus)));
+                     aml_append(ifctx, aml_increment(num_added_cpus));
                      aml_append(ifctx, aml_store(one, has_event));
                  }
                  aml_append(while_ctx, ifctx);
@@ -493,14 +536,41 @@  void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
                  ifctx = aml_if(aml_equal(rm_evt, one));
                  {
                      aml_append(ifctx,
-                         aml_call2(CPU_NOTIFY_METHOD, cpu_data, eject_req));
+                         aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
                      aml_append(ifctx, aml_store(one, rm_evt));
                      aml_append(ifctx, aml_store(one, has_event));
                  }
                  aml_append(else_ctx, ifctx);
                  aml_append(while_ctx, else_ctx);
+                 aml_append(while_ctx, aml_increment(uid));
+            }
+            aml_append(method, while_ctx);
+
+            /*
+             * in case FW negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT,
+             * make upcall to FW, so it can pull in new CPUs before
+             * OS is notified and wakes them up
+             */
+            if (opts.smi_path) {
+                ifctx = aml_if(aml_lnot(aml_equal(num_added_cpus, zero)));
+                {
+                    aml_append(ifctx, aml_store(aml_int(OVMF_CPUHP_SMI_CMD),
+                        aml_name("%s", opts.smi_path)));
+                }
+                aml_append(method, ifctx);
+            }
+
+            /* Notify OSPM about new CPUs and clear insert events */
+            aml_append(method, aml_store(zero, cpu_idx));
+            while_ctx = aml_while(aml_lless(cpu_idx, num_added_cpus));
+            {
+                aml_append(while_ctx, aml_call2(CPU_NOTIFY_METHOD,
+                    aml_derefof(aml_index(new_cpus, cpu_idx)), dev_chk));
+                aml_append(while_ctx, aml_store(one, ins_evt));
+                aml_append(while_ctx, aml_increment(cpu_idx));
             }
             aml_append(method, while_ctx);
+
             aml_append(method, aml_release(ctrl_lock));
         }
         aml_append(cpus_dev, method);
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index b7bcbbbb2a..2291c050ba 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -95,6 +95,7 @@  typedef struct AcpiPmInfo {
     bool s3_disabled;
     bool s4_disabled;
     bool pcihp_bridge_en;
+    bool smi_on_cpuhp;
     uint8_t s4_val;
     AcpiFadtData fadt;
     uint16_t cpu_hp_io_base;
@@ -194,6 +195,7 @@  static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm)
     pm->cpu_hp_io_base = 0;
     pm->pcihp_io_base = 0;
     pm->pcihp_io_len = 0;
+    pm->smi_on_cpuhp = false;
 
     assert(obj);
     init_common_fadt_data(machine, obj, &pm->fadt);
@@ -207,12 +209,16 @@  static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm)
             object_property_get_uint(obj, ACPI_PCIHP_IO_LEN_PROP, NULL);
     }
     if (lpc) {
+        uint64_t smi_features = object_property_get_uint(lpc,
+            ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP, NULL);
         struct AcpiGenericAddress r = { .space_id = AML_AS_SYSTEM_IO,
             .bit_width = 8, .address = ICH9_RST_CNT_IOPORT };
         pm->fadt.reset_reg = r;
         pm->fadt.reset_val = 0xf;
         pm->fadt.flags |= 1 << ACPI_FADT_F_RESET_REG_SUP;
         pm->cpu_hp_io_base = ICH9_CPU_HOTPLUG_IO_BASE;
+        pm->smi_on_cpuhp =
+            !!(smi_features & BIT_ULL(ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT));
     }
 
     /* The above need not be conditional on machine type because the reset port
@@ -1515,6 +1521,32 @@  build_dsdt(GArray *table_data, BIOSLinker *linker,
         aml_append(dev, aml_name_decl("_UID", aml_int(1)));
         aml_append(dev, build_q35_osc_method());
         aml_append(sb_scope, dev);
+
+        if (pm->smi_on_cpuhp) {
+            /* reserve SMI block resources, IO ports 0xB2, 0xB3 */
+            dev = aml_device("PCI0.SMI0");
+            aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A06")));
+            aml_append(dev, aml_name_decl("_UID", aml_string("SMI resources")));
+            crs = aml_resource_template();
+            aml_append(crs,
+                aml_io(
+                       AML_DECODE16,
+                       ACPI_PORT_SMI_CMD,
+                       ACPI_PORT_SMI_CMD,
+                       1,
+                       2)
+            );
+            aml_append(dev, aml_name_decl("_CRS", crs));
+            aml_append(dev, aml_operation_region("SMIR", AML_SYSTEM_IO,
+                aml_int(ACPI_PORT_SMI_CMD), 2));
+            field = aml_field("SMIR", AML_BYTE_ACC, AML_NOLOCK,
+                              AML_WRITE_AS_ZEROS);
+            aml_append(field, aml_named_field("SMIC", 8));
+            aml_append(field, aml_reserved_field(8));
+            aml_append(dev, field);
+            aml_append(sb_scope, dev);
+        }
+
         aml_append(dsdt, sb_scope);
 
         build_hpet_aml(dsdt);
@@ -1530,7 +1562,8 @@  build_dsdt(GArray *table_data, BIOSLinker *linker,
         build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base);
     } else {
         CPUHotplugFeatures opts = {
-            .acpi_1_compatible = true, .has_legacy_cphp = true
+            .acpi_1_compatible = true, .has_legacy_cphp = true,
+            .smi_path = pm->smi_on_cpuhp ? "\\_SB.PCI0.SMI0.SMIC" : NULL,
         };
         build_cpus_aml(dsdt, machine, opts, pm->cpu_hp_io_base,
                        "\\_SB.PCI0", "\\_GPE._E02");
diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index 19f32bed3e..8124d20338 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -647,6 +647,9 @@  static void ich9_lpc_initfn(Object *obj)
                                   &acpi_enable_cmd, OBJ_PROP_FLAG_READ);
     object_property_add_uint8_ptr(OBJECT(lpc), ACPI_PM_PROP_ACPI_DISABLE_CMD,
                                   &acpi_disable_cmd, OBJ_PROP_FLAG_READ);
+    object_property_add_uint64_ptr(obj, ICH9_LPC_SMI_NEGOTIATED_FEAT_PROP,
+                                   &lpc->smi_negotiated_features,
+                                   OBJ_PROP_FLAG_READ);
 
     ich9_pm_add_properties(obj, &lpc->pm);
 }