diff mbox series

[v3] hw/i386: place setup_data at fixed place in memory

Message ID 20220804230411.17720-1-Jason@zx2c4.com
State New
Headers show
Series [v3] hw/i386: place setup_data at fixed place in memory | expand

Commit Message

Jason A. Donenfeld Aug. 4, 2022, 11:04 p.m. UTC
The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, not as
part of the kernel image, and then pointing the setup_data absolute
address to that fixed place in memory. This way, even if OVMF or other
chains relocate the kernel image, the boot parameter still points to the
correct absolute address.

For this, an unused part of the hardware mapped area is used, which
isn't used by anything else.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

Comments

Paolo Bonzini Aug. 5, 2022, 8:10 a.m. UTC | #1
On 8/5/22 01:04, Jason A. Donenfeld wrote:
> +    /* Nothing else uses this part of the hardware mapped region */
> +    setup_data_base = 0xfffff - 0x1000;

Isn't this where the BIOS lives?  I don't think this works.

Does it work to place setup_data at the end of the cmdline file instead 
of having it at the end of the kernel file?  This way the first item 
will be at 0x20000 + cmdline_size.

Paolo
Ard Biesheuvel Aug. 5, 2022, 11:08 a.m. UTC | #2
On Fri, 5 Aug 2022 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > +    /* Nothing else uses this part of the hardware mapped region */
> > +    setup_data_base = 0xfffff - 0x1000;
>
> Isn't this where the BIOS lives?  I don't think this works.
>
> Does it work to place setup_data at the end of the cmdline file instead
> of having it at the end of the kernel file?  This way the first item
> will be at 0x20000 + cmdline_size.
>

Does QEMU always allocate the command line statically like that?
AFAIK, OVMF never accesses that memory to read the command line, it
uses fw_cfg to copy it into a buffer it allocates itself. And I guess
that implies that this region could be clobbered by OVMF unless it is
told to preserve it.
Jason A. Donenfeld Aug. 5, 2022, 12:47 p.m. UTC | #3
Hi Paolo,

On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > +    /* Nothing else uses this part of the hardware mapped region */
> > +    setup_data_base = 0xfffff - 0x1000;
> 
> Isn't this where the BIOS lives?  I don't think this works.

That's the segment dedicated to ROM and hardware mapped addresses. So
that's a place to put ROM material. No actual software will use it.

Jason
Laszlo Ersek Aug. 5, 2022, 1:34 p.m. UTC | #4
On 08/05/22 14:47, Jason A. Donenfeld wrote:
> Hi Paolo,
> 
> On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
>> On 8/5/22 01:04, Jason A. Donenfeld wrote:
>>> +    /* Nothing else uses this part of the hardware mapped region */
>>> +    setup_data_base = 0xfffff - 0x1000;
>>
>> Isn't this where the BIOS lives?  I don't think this works.
> 
> That's the segment dedicated to ROM and hardware mapped addresses. So
> that's a place to put ROM material. No actual software will use it.

... accordingly (I think), when the guest tries to read it, it will see the ROM MemoryRegion that QEMU places there, not RAM contents.

"info mtree" QEMU monitor command output (excerpt), while OVMF is in the Boot Device Selection phase (well, I left it waiting in the Setup TUI):

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000000a0000-00000000000affff (prio 2, ram): alias vga.chain4 @vga.vram 0000000000000000-000000000000ffff
      00000000000a0000-00000000000bffff (prio 1, i/o): vga-lowmem
      00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
      00000000000e0000-00000000000fffff (prio 1, rom): isa-bios

flat view ("info mtree -f"):

FlatView #1
 AS "memory", root: system
 AS "cpu-memory-0", root: system
 AS "cpu-memory-1", root: system
 AS "cpu-memory-2", root: system
 AS "cpu-memory-3", root: system
 AS "mch", root: bus master container
 AS "ICH9-LPC", root: bus master container
 AS "ich9-ahci", root: bus master container
 AS "ICH9-SMB", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "qemu-xhci", root: bus master container
 AS "virtio-scsi-pci", root: bus master container
 AS "virtio-serial-pci", root: bus master container
 AS "virtio-net-pci", root: bus master container
 AS "VGA", root: bus master container
 AS "virtio-balloon-pci", root: bus master container
 AS "virtio-rng-pci", root: bus master container
 Root memory region: system
  0000000000000000-000000000002ffff (prio 0, ram): pc.ram KVM
  0000000000030000-000000000004ffff (prio 1, i/o): smbase-blackhole
  0000000000050000-000000000009ffff (prio 0, ram): pc.ram @0000000000050000 KVM
  00000000000a0000-00000000000affff (prio 1, ram): vga.vram KVM
  00000000000b0000-00000000000bffff (prio 1, i/o): vga-lowmem @0000000000010000
  00000000000c0000-00000000000c3fff (prio 0, rom): pc.ram @00000000000c0000 KVM
  00000000000c4000-00000000000dffff (prio 1, rom): pc.rom @0000000000004000 KVM
  00000000000e0000-00000000000fffff (prio 1, rom): isa-bios KVM


Laszlo
Paolo Bonzini Aug. 5, 2022, 5:29 p.m. UTC | #5
On 8/5/22 13:08, Ard Biesheuvel wrote:
>>
>> Does it work to place setup_data at the end of the cmdline file instead
>> of having it at the end of the kernel file?  This way the first item
>> will be at 0x20000 + cmdline_size.
>>
> Does QEMU always allocate the command line statically like that?
> AFAIK, OVMF never accesses that memory to read the command line, it
> uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> that implies that this region could be clobbered by OVMF unless it is
> told to preserve it.

No it's not. :(  It also goes to gBS->AllocatePages in the end.

At this point it seems to me that without extra changes the whole 
setup_data concept is dead on arrival for OVMF.  In principle there's no 
reason why the individual setup_data items couldn't include interior 
pointers, meaning that the setup_data _has_ to be at the address 
provided in fw_cfg by QEMU.

One way to "fix" it would be for OVMF to overwrite the pointer to the 
head of the list, so that the kernel ignores the setup data provided by 
QEMU. Another way would be to put it in the command line fw_cfg blob and 
teach OVMF to use a fixed address for the command line.  Both are ugly, 
and both are also broken for new QEMU / old OVMF.

In any case, I don't think this should be fixed so close to the release. 
  We have two possibilities:

1) if we believe "build setup_data in QEMU" is a feasible design that 
only needs more yak shaving, we can keep the code in, but disabled by 
default, and sort it out in 7.2.

2) if we go for an alternative design, it needs to be reverted.  For 
example the randomness could be in _another_ fw_cfg file, and the 
linuxboot DMA can patch it in the setup_data.


With (2) the OVMF breakage would be limited to -dtb, which more or less 
nobody cares about, and we can just look the other way.

Paolo
Ard Biesheuvel Aug. 5, 2022, 5:56 p.m. UTC | #6
On Fri, 5 Aug 2022 at 19:29, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/5/22 13:08, Ard Biesheuvel wrote:
> >>
> >> Does it work to place setup_data at the end of the cmdline file instead
> >> of having it at the end of the kernel file?  This way the first item
> >> will be at 0x20000 + cmdline_size.
> >>
> > Does QEMU always allocate the command line statically like that?
> > AFAIK, OVMF never accesses that memory to read the command line, it
> > uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> > that implies that this region could be clobbered by OVMF unless it is
> > told to preserve it.
>
> No it's not. :(  It also goes to gBS->AllocatePages in the end.
>
> At this point it seems to me that without extra changes the whole
> setup_data concept is dead on arrival for OVMF.  In principle there's no
> reason why the individual setup_data items couldn't include interior
> pointers, meaning that the setup_data _has_ to be at the address
> provided in fw_cfg by QEMU.
>

AIUI, the setup_data nodes are appended at the end, so they are not
covered by the setup_data fw_cfg file but the kernel one.

> One way to "fix" it would be for OVMF to overwrite the pointer to the
> head of the list, so that the kernel ignores the setup data provided by
> QEMU. Another way would be to put it in the command line fw_cfg blob and
> teach OVMF to use a fixed address for the command line.  Both are ugly,
> and both are also broken for new QEMU / old OVMF.
>

This is the 'pure EFI' boot path in OVMF, which means that the
firmware does not rely on definitions of struct bootparams or struct
setup_header at all. Introducing that dependency just for this is
something I'd really prefer to avoid.

> In any case, I don't think this should be fixed so close to the release.
>   We have two possibilities:
>
> 1) if we believe "build setup_data in QEMU" is a feasible design that
> only needs more yak shaving, we can keep the code in, but disabled by
> default, and sort it out in 7.2.
>

As I argued before, conflating the 'file' representation with the
'memory' representation like this is fundamentally flawed. fw_cfg
happily DMA's those files anywhere you like, so their contents should
not be position dependent like this.

So Jason's fix gets us halfway there, although we now pass information
to the kernel that is not covered by signatures or measurements, where
the setup_data pointer itself is. This means you can replace a single
SETUP_RNG_SEED node in memory with a whole set of SETUP_xxx nodes that
might be rigged to manipulate the boot in a way that measured boot
won't detect.

This is perhaps a bit of a stretch, and arguably only a problem if
secure or measured boot are enabled to begin with, in which case we
could impose additional policy on the use of setup_data. But still ...

> 2) if we go for an alternative design, it needs to be reverted.  For
> example the randomness could be in _another_ fw_cfg file, and the
> linuxboot DMA can patch it in the setup_data.
>
>
> With (2) the OVMF breakage would be limited to -dtb, which more or less
> nobody cares about, and we can just look the other way.
>
> Paolo
Michael S. Tsirkin Aug. 9, 2022, 9:17 a.m. UTC | #7
On Fri, Aug 05, 2022 at 07:29:29PM +0200, Paolo Bonzini wrote:
> On 8/5/22 13:08, Ard Biesheuvel wrote:
> > > 
> > > Does it work to place setup_data at the end of the cmdline file instead
> > > of having it at the end of the kernel file?  This way the first item
> > > will be at 0x20000 + cmdline_size.
> > > 
> > Does QEMU always allocate the command line statically like that?
> > AFAIK, OVMF never accesses that memory to read the command line, it
> > uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> > that implies that this region could be clobbered by OVMF unless it is
> > told to preserve it.
> 
> No it's not. :(  It also goes to gBS->AllocatePages in the end.
> 
> At this point it seems to me that without extra changes the whole setup_data
> concept is dead on arrival for OVMF.  In principle there's no reason why the
> individual setup_data items couldn't include interior pointers, meaning that
> the setup_data _has_ to be at the address provided in fw_cfg by QEMU.
> 
> One way to "fix" it would be for OVMF to overwrite the pointer to the head
> of the list, so that the kernel ignores the setup data provided by QEMU.
> Another way would be to put it in the command line fw_cfg blob and teach
> OVMF to use a fixed address for the command line.  Both are ugly, and both
> are also broken for new QEMU / old OVMF.
> 
> In any case, I don't think this should be fixed so close to the release.  We
> have two possibilities:
> 
> 1) if we believe "build setup_data in QEMU" is a feasible design that only
> needs more yak shaving, we can keep the code in, but disabled by default,
> and sort it out in 7.2.
> 
> 2) if we go for an alternative design, it needs to be reverted.  For example
> the randomness could be in _another_ fw_cfg file, and the linuxboot DMA can
> patch it in the setup_data.
> 
> 
> With (2) the OVMF breakage would be limited to -dtb, which more or less
> nobody cares about, and we can just look the other way.
> 
> Paolo


So IIUC you retract your pc: add property for Linux setup_data random
number seed then? It's neither of the two options above.
Jason A. Donenfeld Aug. 9, 2022, 12:17 p.m. UTC | #8
Hey Paolo,

On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> Hi Paolo,
> 
> On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > +    /* Nothing else uses this part of the hardware mapped region */
> > > +    setup_data_base = 0xfffff - 0x1000;
> > 
> > Isn't this where the BIOS lives?  I don't think this works.
> 
> That's the segment dedicated to ROM and hardware mapped addresses. So
> that's a place to put ROM material. No actual software will use it.
> 
> Jason

Unless I've misread the thread, I don't think there are any remaining
objections, right? Can we try merging this and seeing if it fixes the
issue for good?

Jason
Michael S. Tsirkin Aug. 9, 2022, 2:07 p.m. UTC | #9
On Tue, Aug 09, 2022 at 02:17:23PM +0200, Jason A. Donenfeld wrote:
> Hey Paolo,
> 
> On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> > Hi Paolo,
> > 
> > On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > > +    /* Nothing else uses this part of the hardware mapped region */
> > > > +    setup_data_base = 0xfffff - 0x1000;
> > > 
> > > Isn't this where the BIOS lives?  I don't think this works.
> > 
> > That's the segment dedicated to ROM and hardware mapped addresses. So
> > that's a place to put ROM material. No actual software will use it.
> > 
> > Jason
> 
> Unless I've misread the thread, I don't think there are any remaining
> objections, right? Can we try merging this and seeing if it fixes the
> issue for good?
> 
> Jason

Laszlo commented here:
https://lore.kernel.org/r/fa0601e4-acf5-0ce8-9277-4d90d046b53e%40redhat.com
Daniel P. Berrangé Aug. 9, 2022, 2:15 p.m. UTC | #10
On Tue, Aug 09, 2022 at 10:07:44AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 09, 2022 at 02:17:23PM +0200, Jason A. Donenfeld wrote:
> > Hey Paolo,
> > 
> > On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> > > Hi Paolo,
> > > 
> > > On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > > > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > > > +    /* Nothing else uses this part of the hardware mapped region */
> > > > > +    setup_data_base = 0xfffff - 0x1000;
> > > > 
> > > > Isn't this where the BIOS lives?  I don't think this works.
> > > 
> > > That's the segment dedicated to ROM and hardware mapped addresses. So
> > > that's a place to put ROM material. No actual software will use it.
> > > 
> > > Jason
> > 
> > Unless I've misread the thread, I don't think there are any remaining
> > objections, right? Can we try merging this and seeing if it fixes the
> > issue for good?
> > 
> > Jason
> 
> Laszlo commented here:
> https://lore.kernel.org/r/fa0601e4-acf5-0ce8-9277-4d90d046b53e%40redhat.com

It is 7.1.0 rc2 date today, which leaves ideally only one rc remaining
before GA release.

The discussion still taking place in this thread does not fill me with
confidence that we're going to have a *well tested* solution before GA.
Even if we agree on a patch, are we really going to have confidence
in it being reliable if we've only got a week of testing ?

IMHO we're at the point where we should just disable the RNG feature
for 7.1.0, and gives ourselves time to come up with a solution in 7.2.0
that can be properly tested without the time pressure of release deadlines.


With regards,
Daniel
Paolo Bonzini Aug. 9, 2022, 2:19 p.m. UTC | #11
On 8/9/22 11:17, Michael S. Tsirkin wrote:
>> 1) if we believe "build setup_data in QEMU" is a feasible design that only
>> needs more yak shaving, we can keep the code in, but disabled by default,
>> and sort it out in 7.2.
>>
>> 2) if we go for an alternative design, it needs to be reverted.  For example
>> the randomness could be in _another_ fw_cfg file, and the linuxboot DMA can
>> patch it in the setup_data.
>>
>> With (2) the OVMF breakage would be limited to -dtb, which more or less
>> nobody cares about, and we can just look the other way.
> 
> So IIUC you retract your pc: add property for Linux setup_data random
> number seed then? It's neither of the two options above.

That one would be a base for (1).

Another choice (3) is to put a pointer to the first setup_data in a new 
fw_cfg entry, and let the option ROMs place it in the header.

In any case, as Laszlo said this [PATCH v3] does not work because 
0xf0000 is mapped as ROM (and if it worked, it would have the same 
problem as the first 640K).

Paolo
diff mbox series

Patch

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 050eedc0c8..3affef3277 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -773,10 +773,10 @@  void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
     uint32_t initrd_max;
-    uint8_t header[8192], *setup, *kernel;
-    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
+    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
+    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
     FILE *f;
     char *vmode;
     MachineState *machine = MACHINE(x86ms);
@@ -899,6 +899,8 @@  void x86_load_linux(X86MachineState *x86ms,
         cmdline_addr = 0x20000;
         prot_addr    = 0x100000;
     }
+    /* Nothing else uses this part of the hardware mapped region */
+    setup_data_base = 0xfffff - 0x1000;
 
     /* highest address for loading the initrd */
     if (protocol >= 0x20c &&
@@ -1062,34 +1064,35 @@  void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
     }
 
-    /* Offset 0x250 is a pointer to the first setup_data link. */
-    stq_p(header + 0x250, first_setup_data);
+    if (first_setup_data && !sev_enabled()) {
+            /* Offset 0x250 is a pointer to the first setup_data link. */
+            stq_p(header + 0x250, first_setup_data);
+            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
+                         setup_data_base, NULL, NULL, NULL, NULL, false);
+    }
 
     /*
      * If we're starting an encrypted VM, it will be OVMF based, which uses the