diff mbox

[v5,05/10] ACPI: Add Virtual Machine Generation ID support

Message ID ffba05c5-0b4e-e2a6-2260-4f0da0f53e02@redhat.com
State New
Headers show

Commit Message

Laszlo Ersek Feb. 8, 2017, 11:43 p.m. UTC
On 02/08/17 23:34, Ben Warren wrote:
> 
>> On Feb 7, 2017, at 4:48 PM, Laszlo Ersek <lersek@redhat.com
>> <mailto:lersek@redhat.com>> wrote:
>>
>> On 02/05/17 10:12, ben@skyportsystems.com
>> <mailto:ben@skyportsystems.com> wrote:
>>> From: Ben Warren <ben@skyportsystems.com <mailto:ben@skyportsystems.com>>
>>>
>>> This implements the VM Generation ID feature by passing a 128-bit
>>> GUID to the guest via a fw_cfg blob.
>>> Any time the GUID changes, an ACPI notify event is sent to the guest
>>>
>>> The user interface is a simple device with one parameter:
>>> - guid (string, must be "auto" or in UUID format
>>>   xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
>>>
>>> Signed-off-by: Ben Warren <ben@skyportsystems.com
>>> <mailto:ben@skyportsystems.com>>
>>> ---
>>> default-configs/i386-softmmu.mak     |   1 +
>>> default-configs/x86_64-softmmu.mak   |   1 +
>>> hw/acpi/Makefile.objs                |   1 +
>>> hw/acpi/vmgenid.c                    | 206
>>> +++++++++++++++++++++++++++++++++++
>>> hw/i386/acpi-build.c                 |  10 ++
>>> include/hw/acpi/acpi_dev_interface.h |   1 +
>>> include/hw/acpi/vmgenid.h            |  37 +++++++
>>> 7 files changed, 257 insertions(+)
>>> create mode 100644 hw/acpi/vmgenid.c
>>> create mode 100644 include/hw/acpi/vmgenid.h
>>
>> [snip code]
>>
>> So, I'm late to the game. I can't possibly comment on all the concerns
>> that have been raised scattered around the thread, exactly *where* they
>> have been raised. However, I will try to collect the concerns here.
>>
>>
>> (1) Byte swapping for the UUID.
>>
>> The idea of QemuUUID is that wherever it is stored in a non-ephemeral
>> fashion, it is in BE representation. The guest needs it in LE, so we
>> should convert it temporarily -- using a local variable -- just before
>> writing it to guest memory, and then forget about the LE representation
>> promptly.
>>
>> As far as I understand, we all agree on this (Michael, Ben, myself -- I
>> think Igor hasn't commented on this).
>>
>>
> Sure, will rework.
>> (2) Structure of the vmgenid fw_cfg blob, AKA "why that darn offset 40
>> decimal".
>>
>> I explained it under points (6) and (7) in the following message:
>>
>> Message-Id: <c16a03d5-820a-f719-81ea-43858f903395@redhat.com
>> <mailto:c16a03d5-820a-f719-81ea-43858f903395@redhat.com>>
>> URL: https://www.mail-archive.com/qemu-devel@nongnu.org/msg425226.html
>>
>> The story is that *wherever* an ADD_POINTER command points to -- that
>> is, at the *exact* target address --, OVMF will look for an ACPI table
>> header, somewhat heuristically. If it finds a byte pattern that (a) fits
>> into the remaining blob and (b) passes some superficial ACPI table
>> header checks, such as length and checksum, then OVMF assumes that the
>> blob contains an ACPI table there, and passes the address (again, the
>> exact, relocated, absolute target address of ADD_POINTER) to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable().
>>
>> We want to disable this heuristic for the vmgenid blob. *If* the blob
>> contained only 16 bytes (for the GUID), then the heuristic would
>> automatically disable itself, because the ACPI table header (36 bytes)
>> is larger than 16 bytes, so OVMF wouldn't even look. However, for the
>> caching and other VMGENID requirements, we need to allocate a full page
>> with the blob (4KB), hence OVMF *will* look. If we placed the GUID right
>> at the start of the page, then OVMF would sanity-check it as an ACPI
>> table header. The check would *most likely* not pass, so things would be
>> fine in practice, but we can do better than that: just put 40 zero bytes
>> at the front of the blob.
>>
>> And this is why the ADD_POINTER command has to point to the beginning of
>> the blob (i.e., 36 zero bytes), to "suppress the OVMF ACPI SDT header
>> detection". (The other 4 bytes are for arriving at an address divisible
>> by 8, which is a VMGENID requirement for the GUID.)
>>
>> The consequence is that both the ADDR method and QEMU's guest memory
>> access code have to add 40 manually.
>>
> Igor has recommended that I have the add_pointer() call that patches AML
> have an offset of 40 from the start of the source file, which will
> result in the VGIA AML variable pointing to the GUID, not offset by 40.
>  I assume this isn’t a problem for you, but please confirm.

No, that would be a problem. The explanation as to why is right above.

Please process the rest of this sub-thread; Igor has agreed that my
request makes sense, he'd just like to see the reasoning behind it
documented:

http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01640.html

To which I suggested that you please capture the above argument in a new
subsection in the docs file:

http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01642.html

To which Igor replied that we should reference that documentation from
the exact spot in the code, where 40 decimal is added:

http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01647.html

That's apparently our agreement on this. :)

>>
>> (3) Whether the VGIA named object should be a DWORD or a QWORD in ACPI.
>> Personally, I'm fine with either, but I see that DWORD is more
>> compatible with old OSes.
>>
>> What I really care about though is the new WRITE_POINTER command
>> structure. That should support 8 bytes as well.
>>
>> So this is exactly how Michael summarized it ultimately:
>> - use a DWORD for VGIA in ACPI,
>> - with a matching 4-byte wide ADD_POINTER command;
>> - write the address with a 4-byte wide WRITE_POINTER command into
>>  fw_cfg,
>> - *but* the WRITE_POINTER command struct should support 8-byte as well,
>> - and the fw_cfg file that is written (vmgenid_addr) should have size 8
>>  from the start. (The guest will simply write offsets 0..3 inclusive,
>>  in LE byte order.)
>>
> This is an easy change, WRITE_POINTER already supports 8-byte writes.
>>
>> (4) IIRC Igor asked if sizing the blob to 4KB would guarantee that the
>> GUID ends up in a cacheable page -- yes (see also Michael's followup).
>> This size guarantees that the GUID will have a full page to itself, and
>> its allocated from normal guest RAM, as Reserved memory (SeaBIOS, IIRC)
>> or EfiACPIMemoryNVS memory (OVMF).
>>
>> Note that the VMGENID spec says, in ACPI terms, "It must not be in
>> ranges reported as AddressRangeMemory or AddressRangeACPI", but that's
>> fine:
>>
>> ACPI speak           UEFI speak            suitable for VMGENID?
>> ------------------   --------------------- ---------------------
>> AddressRangeMemory   EfiConventionalMemory no
>> AddressRangeACPI     EfiACPIReclaimMemory  no
>> AddressRangeNVS      EfiACPIMemoryNVS      yes, used by OVMF
>> AddressRangeReserved EfiReservedMemoryType yes, used by SeaBIOS
>>
>>
>> (5) The race and fw_cfg callbacks.
>>
>> (a) Michael and Ben identified a problem where
>> - QEMU places the initial GUID in the "vmgenid" fw_cfg blob,
>> - the guest downloads it,
>> - the host admin changes the GUID via the monitor,
>> - and the guest writes the address to "vmgenid_addr" *only* after that.
>>
>> In other words, the monitor action occurs between the guest's processing
>> of the ALLOCATE and WRITE_POINTER linker/loader commands.
>>
>> (b) A similar problem is rebooting the guest.
>> - The guest starts up fine,
>> - the OS runs,
>> - the host admin changes the GUID via the monitor,
>> - and the guest sees the update fine.
>> - At this point the guest is rebooted (within the same QEMU instance).
>>
>> Since the fw_cfg blob is not rebuilt -- more precisely, the blob *is*
>> rebuilt, but it is then thrown away in acpi_build_update() --, the guest
>> will see a different GUID from the last value set on the monitor. (And
>> querying the monitor will actually return that last-set value, which is
>> no longer known by the guest.)
>>
>>
>> The suggestion for these problems was to add a callback to "one" of the
>> fw_cfg files in question.
>>
>> Let's investigate the writeable "vmgenid_addr" file first. Here the idea
>> is to rewrite the GUID in guest RAM as soon as the address is known from
>> the guest, regardless of what GUID the guest placed there originally,
>> through downloading the "vmgenid" blob.
>>
>> (i) We have no write callbacks at the moment. Before we removed the data
>> port based write support in QEMU 2.4, the write callback used to be
>> invoked when the end of the fw_cfg blob was reached. Since we intend to
>> pack several addresses in the same writeable fw_cfg blob down the road,
>> at different offsets, this method wouldn't work; the final offset in the
>> blob might not be reached at all. (Another thing that wouldn't work is
>> an 8-byte writeable fw_cfg blob, and a 4-byte WRITE_POINTER command that
>> only rewrites offsets 0..3 inclusve -- see (3) above.)
>>
>> So I don't think that a write callback is viable, at least with the
>> "many addresses at different offsets in the same blob" feature. Unless
>> we want to register different callbacks (or different callback
>> arguments) for different offsets. And then a large write could trigger a
>> series of callbacks, even, if it covers several special offsets. This
>> looks too complex to me.
>>
>> (ii) What we do have now is a select callback. Unfortunately, the guest
>> is not required to select the item and atomically rewrite the blob, in
>> the same fw_cfg DMA operation. The guest is allowed to do the selection
>> with the IO port method, and then write the blob separately with DMA.
>> (This is what OVMF does.) IOW, when we'd get the callback (for the
>> select), the address would not have been written yet.
>>
>> So I don't think that a select callback for the writeable "vmgenid_addr"
>> file is viable either.
>>
>> (iii) Let's see a select callback for the "vmgenid" blob itself! The
>> idea is, whenever the guest selects this blob, immediately refresh the
>> blob from the last set GUID. Then let the guest download the
>> just-refreshed blob.
>>
> I like this approach, and have it working.

Cool! :)

> I assume by “refresh the
> blob” you mean making a call to fw_cfg_modify_file().  If that is the
> case, I need to modify the function fw_cfg_modify_bytes_read() because
> of the following commented-out problem lines (656:657 in fw_cfg.c):
> 
>     /* return the old data to the function caller, avoid memory leak */
>     ptr = s->entries[arch][key].data;
>     s->entries[arch][key].data = data;
>     s->entries[arch][key].len = len;
> //    s->entries[arch][key].callback_opaque = NULL;
> //    s->entries[arch][key].allow_write = false;
> 
> As you can see, this function modifies not only a fw_cfg object’s data,
> but also its metadata, wiping out the callback parameter and forcing it
> to read-only.  I don’t know the history of this function, so want to
> tread lightly.  The second parameter doesn’t impact me, since this
> fw_cfg object is read-only, but the callback parameter one does.

It is justified to tread lightly :), because you really don't need to
call fw_cfg_modify_file().

Instead, in vmgenid_add_fw_cfg(), please do this:


and then, whenever necessary, you can poke the GUID right into
"vms->guid_blob" (i.e., the fw_cfg data), at the appropriate offset
(40). No reallocation / metadata changes are necessary.

However, this means that you have to extend vmgenid_post_load() *too*,
with the same fw_cfg blob update. (You don't have to migrate the blob,
it is a whole bunch of zeroes with a GUID in the middle that has to be
refreshed on the target host anyway.)

> 
>> The main observation here is that we can *fail* QMP commands. That's
>> good enough if we can tell *when* to fail them.
>>
>> Under this idea, we would fail the "set-vm-generation-id" QMP command
>> with "EAGAIN" if the quest had selected the "vmgenid" blob since machine
>> reset -- this can be tracked with a latch -- *but* the particular offset
>> range in the "vmgenid_addr" blob were still zero.
>>
>> This is exactly the window between the ALLOCATE and WRITE_POINTER
>> commands where updates from the monitor would be lost. So let's just
>> tell the management layer to try again later (like 1 second later). By
>> that time, any sane guest will have written the address into
>> "vmgenid_addr”.
>>
> This seems sane.  I’ll just return EAGAIN if vmgenid_addr is zero,
> meaning the guest is not fully synchronized yet so don’t accept input.

Right, it means that the guest is temporarily in a state where it cannot
accept a new GUID.

Regarding the error structure -- we have structured errors! --, please
don't use error_setg(), nor error_setg_errno(). "EAGAIN" above was just
an example. Both of the aforementioned functions format an
ERROR_CLASS_GENERIC_ERROR, which is not really useful for libvirt if
libvirt wants to treat this condition specially.

Instead, we have another error class that looks appropriate:
ERROR_CLASS_DEVICE_NOT_ACTIVE. If you grep the source for it, you'll
find two independent users, which is (IMO) license to use it for our
purposes as well.

    error_set(errp, ERROR_CLASS_DEVICE_NOT_ACTIVE,
              "<format-string>", arguments ...);

See also "@DeviceNotActive" under "@QapiErrorClass" in "qapi/common.json".

If you grep the libvirt source code for "DeviceNotActive", there are
several hits (with custom error handling) in "src/qemu/qemu_monitor_json.c".

> 
> One other more radical idea I had was to just remove the monitor GUID
> modification capability completely.

You are going to laugh, but this was actually what I was about to
suggest while writing the email you've just replied to :)

I could not figure out *why* the management layer would want to change
the VMGUID for an already running domain. Generate a random one at
startup, or at incoming migration (hence, on the command line!), which
also covers restart from snapshot I guess, is fully justified, but why
mess with it otherwise? Where does this requirement come from?

In the docs patch, you mention Hyper-V and Xen as other VMMs that
support vmgenid. I didn't look into Hyper-V, but googling Xen, it looked
like they had a similar management interface (for runtime). At that
point I said to myself, "okay, so this is for 'feature parity' for all I
know", and deleted like three paragraphs :)

>  The VM Generation ID needs to
> change only in the following situations:
> - new VM (command line processed)
> - start from snapshot (command line processed, VGIA restored from VmState.

(this is also "incoming migration")

> - VM recovered from backup (like a new VM, command line processed).
> - VM imported, copied or cloned (like a new VM, command line processed).
> - failed over from a disaster recovery environment.  Here I’m not sure.
> It’s possible I suppose that one would have a DR VM in hot standby, but
> would want to change its VM Generation ID in order to force Windows to
> re-synchronize Active Directory, crypto seeds etc.  Here the monitor
> might be necessary.

Well, I really don't know. I guess we can follow approach (iii), just to
be safe.

Thanks!
Laszlo
diff mbox

Patch

diff --git a/hw/acpi/vmgenid.c b/hw/acpi/vmgenid.c
index af8b35cebbe7..3860e9717e12 100644
--- a/hw/acpi/vmgenid.c
+++ b/hw/acpi/vmgenid.c
@@ -111,6 +111,8 @@  void vmgenid_add_fw_cfg(FWCfgState *s, GArray *guid)
     assert(obj);
     VmGenIdState *vms = VMGENID(obj);

+    vms->guid_blob = guid->data;
+
     /* Create a read-only fw_cfg file for GUID */
     fw_cfg_add_file(s, VMGENID_GUID_FW_CFG_FILE, guid->data,
                     VMGENID_FW_CFG_SIZE);