diff mbox

[RFC,v3,3/3] fw/pci: Allocate IGD stolen memory

Message ID 20160213002318.18456.48603.stgit@gimli.home
State New
Headers show

Commit Message

Alex Williamson Feb. 13, 2016, 12:23 a.m. UTC
Intel IGD makes use of memory allocated and marked reserved by the
BIOS as a stolen memory range.  For the most part, guest drivers don't
make use of this, but our achilles heel is the vBIOS.  The vBIOS
programs the device to use the host stolen memory range and it's used
in the pre-boot environment.  Generally the guest won't have access to
the host stolen memory area, so these accesses either land in VM
memory or unassigned space and generate IOMMU faults.  By allocating
this range in SeaBIOS and programming it into the device, QEMU (via
vfio) can make sure this guest allocated stolen memory range is used
instead.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 src/fw/pciinit.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Kevin O'Connor Feb. 13, 2016, 2:49 a.m. UTC | #1
On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:
> Intel IGD makes use of memory allocated and marked reserved by the
> BIOS as a stolen memory range.  For the most part, guest drivers don't
> make use of this, but our achilles heel is the vBIOS.  The vBIOS
> programs the device to use the host stolen memory range and it's used
> in the pre-boot environment.  Generally the guest won't have access to
> the host stolen memory area, so these accesses either land in VM
> memory or unassigned space and generate IOMMU faults.  By allocating
> this range in SeaBIOS and programming it into the device, QEMU (via
> vfio) can make sure this guest allocated stolen memory range is used
> instead.

What does "vBIOS" mean in this context?  Is it the video device option
rom or something else?

> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  src/fw/pciinit.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
> index 92170d5..c1ad5d4 100644
> --- a/src/fw/pciinit.c
> +++ b/src/fw/pciinit.c
> @@ -260,7 +260,7 @@ static void ich9_smbus_setup(struct pci_device *dev, void *arg)
>  static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
>  {
>      struct romfile_s *file = romfile_find("etc/igd-opregion");
> -    void *opregion;
> +    void *opregion, *bdsm;
>      u16 bdf = dev->bdf;
>  
>      if (!file || !file->size)
> @@ -281,6 +281,17 @@ static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
>  
>      dprintf(1, "Intel IGD OpRegion enabled on %02x:%02x.%x\n",
>              pci_bdf_to_bus(bdf), pci_bdf_to_dev(bdf), pci_bdf_to_fn(bdf));
> +
> +    bdsm = memalign_high(1024 * 1024, 1024 * 1024);
> +    if (!bdsm) {
> +        warn_noalloc();
> +        return;
> +    }

The "high" memory pool is not a good fit for such a large allocation.
For so much space, I'd use memalign_tmphigh() followed by
e820_add(addr, size, E820_RESERVED).

> +    pci_config_writel(bdf, 0x5C, cpu_to_le32((u32)bdsm));
> +
> +    dprintf(1, "Allocated 1MB reserved memory for Intel IGD stolen memory at "
> +            "0x%08x\n", (u32)bdsm);
>  }

Does this make sense to do unconditionally in the firmware whenever
the device is present?  The SeaBIOS release schedule is not in sync
with QEMU, so changes in the firmware tend to take longer to deploy if
something needs to be done differently in the future.

What happens if this register is not set?  Does anything go wrong if
register 0xFC is set, but 0x5C is not (eg, due to allocation failure).

Thanks.
-Kevin
Alex Williamson Feb. 13, 2016, 3:12 p.m. UTC | #2
Hi Kevin,

On Fri, 12 Feb 2016 21:49:04 -0500
"Kevin O'Connor" <kevin@koconnor.net> wrote:

> On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:
> > Intel IGD makes use of memory allocated and marked reserved by the
> > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > programs the device to use the host stolen memory range and it's used
> > in the pre-boot environment.  Generally the guest won't have access to
> > the host stolen memory area, so these accesses either land in VM
> > memory or unassigned space and generate IOMMU faults.  By allocating
> > this range in SeaBIOS and programming it into the device, QEMU (via
> > vfio) can make sure this guest allocated stolen memory range is used
> > instead.  
> 
> What does "vBIOS" mean in this context?  Is it the video device option
> rom or something else?

vBIOS = video BIOS, you're correct, it's just the video device option
ROM.
 
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  src/fw/pciinit.c |   13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
> > index 92170d5..c1ad5d4 100644
> > --- a/src/fw/pciinit.c
> > +++ b/src/fw/pciinit.c
> > @@ -260,7 +260,7 @@ static void ich9_smbus_setup(struct pci_device *dev, void *arg)
> >  static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
> >  {
> >      struct romfile_s *file = romfile_find("etc/igd-opregion");
> > -    void *opregion;
> > +    void *opregion, *bdsm;
> >      u16 bdf = dev->bdf;
> >  
> >      if (!file || !file->size)
> > @@ -281,6 +281,17 @@ static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
> >  
> >      dprintf(1, "Intel IGD OpRegion enabled on %02x:%02x.%x\n",
> >              pci_bdf_to_bus(bdf), pci_bdf_to_dev(bdf), pci_bdf_to_fn(bdf));
> > +
> > +    bdsm = memalign_high(1024 * 1024, 1024 * 1024);
> > +    if (!bdsm) {
> > +        warn_noalloc();
> > +        return;
> > +    }  
> 
> The "high" memory pool is not a good fit for such a large allocation.
> For so much space, I'd use memalign_tmphigh() followed by
> e820_add(addr, size, E820_RESERVED).

Ok, as you saw on IRC I was looking for alternatives, I can easily
switch to this.
 
> > +    pci_config_writel(bdf, 0x5C, cpu_to_le32((u32)bdsm));
> > +
> > +    dprintf(1, "Allocated 1MB reserved memory for Intel IGD stolen memory at "
> > +            "0x%08x\n", (u32)bdsm);
> >  }  
> 
> Does this make sense to do unconditionally in the firmware whenever
> the device is present?  The SeaBIOS release schedule is not in sync
> with QEMU, so changes in the firmware tend to take longer to deploy if
> something needs to be done differently in the future.
> 
> What happens if this register is not set?  Does anything go wrong if
> register 0xFC is set, but 0x5C is not (eg, due to allocation failure).

It's not entirely unconditional, it's tagged onto the end of the
OpRegion setup, so we know that we're at least dealing with a QEMU that
has that support.  We could link it to the PCI ROM BAR for IGD being
present since this is necessary to support the code in that ROM, but
that doesn't give us any idea if the QEMU support is there.  Maybe we
could do both, OpRegion support and ROM BAR present.

The write to 0x5C is used by QEMU code that traps writes to the device
I/O port BAR and replaces the host stolen memory address with the new
guest address generated here.  0x5C is initialized to 0x0 by kernel
vfio code, so we can detect whether it has been written.  If not
written, QEMU has no place to redirect to for stolen memory and it will
either overlap VM memory or an unassigned area.  The former may corrupt
VM memory, the latter throws host errors.  We could in QEMU halt with a
hardware error if 0x5C hasn't been programmed.

One alternative I was considering was to simply use the unused BAR5
slot on IGD to expose a fake 1MB, 32bit MMIO BAR.  SeaBIOS would program
this normally, QEMU would back it with its own allocation and we could
easily find the address of it when we need it.  The downside being that
the PCI BAR can be disabled and remapped, but for the small window
where we need it, I'm not sure that matters.  It would certainly be
more transparent to the VM BIOS.  It would also make it much easier to
change the size if we found some devices need more/less/none space.  I
initially thought the fake BAR wasn't a great fix, but maybe its
advantages outweigh its downsides.  I'll give it a shot.  Thanks,

Alex
Kevin O'Connor Feb. 13, 2016, 6:18 p.m. UTC | #3
On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:
> On Fri, 12 Feb 2016 21:49:04 -0500
> "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:
> > > Intel IGD makes use of memory allocated and marked reserved by the
> > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > programs the device to use the host stolen memory range and it's used
> > > in the pre-boot environment.  Generally the guest won't have access to
> > > the host stolen memory area, so these accesses either land in VM
> > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > vfio) can make sure this guest allocated stolen memory range is used
> > > instead.  
> > 
> > What does "vBIOS" mean in this context?  Is it the video device option
> > rom or something else?
> 
> vBIOS = video BIOS, you're correct, it's just the video device option
> ROM.

Is the problem from when the host runs the video option rom, or is the
problem from the guest (via SeaBIOS) running the video option rom?  If
the guest is running the option rom, is it the first time the option
rom has been run for the machine (ie, was the option rom not executed
on the host when the host machine first booted)?

FWIW, many of the chromebooks use coreboot with Intel graphics and a
number of users have installed SeaBIOS (running natively) on their
machines.  Running the intel video option rom more than once has been
known to cause issues.

[...]
> The write to 0x5C is used by QEMU code that traps writes to the device
> I/O port BAR and replaces the host stolen memory address with the new
> guest address generated here.  0x5C is initialized to 0x0 by kernel
> vfio code, so we can detect whether it has been written.  If not
> written, QEMU has no place to redirect to for stolen memory and it will
> either overlap VM memory or an unassigned area.  The former may corrupt
> VM memory, the latter throws host errors.  We could in QEMU halt with a
> hardware error if 0x5C hasn't been programmed.

So, if I understand correctly, 0x5C is not a "real" register on the
hardware, but is instead just a mechanism to give QEMU the address of
some guest visible ram?

BTW, is 0xFC a "real" register in the hardware?  How does the guest
find the location of the "OpRegion" if SeaBIOS allocates it?

-Kevin
Alex Williamson Feb. 13, 2016, 6:51 p.m. UTC | #4
On Sat, 13 Feb 2016 13:18:39 -0500
"Kevin O'Connor" <kevin@koconnor.net> wrote:

> On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:
> > On Fri, 12 Feb 2016 21:49:04 -0500
> > "Kevin O'Connor" <kevin@koconnor.net> wrote:  
> > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:  
> > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > programs the device to use the host stolen memory range and it's used
> > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > the host stolen memory area, so these accesses either land in VM
> > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > instead.    
> > > 
> > > What does "vBIOS" mean in this context?  Is it the video device option
> > > rom or something else?  
> > 
> > vBIOS = video BIOS, you're correct, it's just the video device option
> > ROM.  
> 
> Is the problem from when the host runs the video option rom, or is the
> problem from the guest (via SeaBIOS) running the video option rom?  If
> the guest is running the option rom, is it the first time the option
> rom has been run for the machine (ie, was the option rom not executed
> on the host when the host machine first booted)?
> 
> FWIW, many of the chromebooks use coreboot with Intel graphics and a
> number of users have installed SeaBIOS (running natively) on their
> machines.  Running the intel video option rom more than once has been
> known to cause issues.

The issue is in the VM and it occurs every time the option ROM is
executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
string and boot menu), but any sort of extended graphics mode (ex. live
CD boot menu) tries to make use of the host memory area which
corresponds to the stolen memory area of the physical device.  We're
not really sure how the ROM execution arrives at these addresses (not
from the device according to access traces), but we can see when the
ROM is writing these addresses to the device and modify they addresses
to point at a VM memory range which we've allocated.  That's what this
code attempts to do, allocate a buffer and tell QEMU about it via the
BDSM (Base Data of Stolen Memory) register.
 
> [...]
> > The write to 0x5C is used by QEMU code that traps writes to the
> > device I/O port BAR and replaces the host stolen memory address
> > with the new guest address generated here.  0x5C is initialized to
> > 0x0 by kernel vfio code, so we can detect whether it has been
> > written.  If not written, QEMU has no place to redirect to for
> > stolen memory and it will either overlap VM memory or an unassigned
> > area.  The former may corrupt VM memory, the latter throws host
> > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > been programmed.  
> 
> So, if I understand correctly, 0x5C is not a "real" register on the
> hardware, but is instead just a mechanism to give QEMU the address of
> some guest visible ram?

It is a real register, BDSM that is virtualized by vfio turning it
effectively into a scratch register.  On physical hardware the
register is read-only.
 
> BTW, is 0xFC a "real" register in the hardware?  How does the guest
> find the location of the "OpRegion" if SeaBIOS allocates it?

0xFC is the ASL Storage register, the guest finds the location of the
OpRegion using this register.  This is another register that is
read-only on real hardware but virtualized through vfio so we can
relocate the OpRegion into the VM address space.

I've found that allocating a dummy MMIO BAR does work as an alternative
for mapping space for this stolen memory into the VM address space.
For a Linux guest it works to allocate BAR5 on the IGD device.
Windows10 is not so happy with this, but does work if I allocate the
BAR on something like the ISA bridge device.  My guess is that the IGD
driver in Windows freaks out at finding this strange new BAR on its
device.  So I'll need to come up with an algorithm for either creating
a dummy PCI device to host this BAR or trying to add it to other
existing devices.  It's certainly a more self-contained solution this
way, so I expect we'll only need patch 1/3 from this series.  Thanks,

Alex
Kevin O'Connor Feb. 13, 2016, 8:05 p.m. UTC | #5
On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:
> On Sat, 13 Feb 2016 13:18:39 -0500
> "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:
> > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > "Kevin O'Connor" <kevin@koconnor.net> wrote:  
> > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:  
> > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > programs the device to use the host stolen memory range and it's used
> > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > the host stolen memory area, so these accesses either land in VM
> > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > instead.    
> > > > 
> > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > rom or something else?  
> > > 
> > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > ROM.  
> > 
> > Is the problem from when the host runs the video option rom, or is the
> > problem from the guest (via SeaBIOS) running the video option rom?  If
> > the guest is running the option rom, is it the first time the option
> > rom has been run for the machine (ie, was the option rom not executed
> > on the host when the host machine first booted)?
> > 
> > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > number of users have installed SeaBIOS (running natively) on their
> > machines.  Running the intel video option rom more than once has been
> > known to cause issues.
> 
> The issue is in the VM and it occurs every time the option ROM is
> executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> string and boot menu), but any sort of extended graphics mode (ex. live
> CD boot menu) tries to make use of the host memory area which
> corresponds to the stolen memory area of the physical device.  We're
> not really sure how the ROM execution arrives at these addresses (not
> from the device according to access traces), but we can see when the
> ROM is writing these addresses to the device and modify they addresses
> to point at a VM memory range which we've allocated.  That's what this
> code attempts to do, allocate a buffer and tell QEMU about it via the
> BDSM (Base Data of Stolen Memory) register.

Forgive me if I'm not fully understanding this.  If I read what you're
saying then the sequence is something like:

1 - the host system bios (or vgabios) programs the GTT/stolen memory
    base register at host system bootup time and reserves it in the
    host e820 map.

2 - upon running qemu, the guest reruns the vga bios option rom which
    seems to work (ie, text mode works)

3 - in the guest, upon running a bootloader that uses graphics mode,
    the bootloader calls the vgabios to switch to graphics mode, and
    the vgabios sends commands to the graphics hardware that somehow
    reference the host stolen memory

4 - your patch causes QEMU to catch these commands with references to
    the host stolen memory and replace them with references to the
    guest stolen memory (which seabios creates)

Am I understanding the above correctly?

Is the only reason to run the intel option rom in the guest for
bootloader graphic mode support?  Last time I looked, the intel vga
hardware could fully emulate a legacy vga device - if the device is in
vga compatibility mode then it may be possible to have seavgabios
drive mode changes.

> > [...]
> > > The write to 0x5C is used by QEMU code that traps writes to the
> > > device I/O port BAR and replaces the host stolen memory address
> > > with the new guest address generated here.  0x5C is initialized to
> > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > written.  If not written, QEMU has no place to redirect to for
> > > stolen memory and it will either overlap VM memory or an unassigned
> > > area.  The former may corrupt VM memory, the latter throws host
> > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > been programmed.  
> > 
> > So, if I understand correctly, 0x5C is not a "real" register on the
> > hardware, but is instead just a mechanism to give QEMU the address of
> > some guest visible ram?
> 
> It is a real register, BDSM that is virtualized by vfio turning it
> effectively into a scratch register.  On physical hardware the
> register is read-only.
>  
> > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > find the location of the "OpRegion" if SeaBIOS allocates it?
> 
> 0xFC is the ASL Storage register, the guest finds the location of the
> OpRegion using this register.  This is another register that is
> read-only on real hardware but virtualized through vfio so we can
> relocate the OpRegion into the VM address space.
> 
> I've found that allocating a dummy MMIO BAR does work as an alternative
> for mapping space for this stolen memory into the VM address space.
> For a Linux guest it works to allocate BAR5 on the IGD device.
> Windows10 is not so happy with this, but does work if I allocate the
> BAR on something like the ISA bridge device.  My guess is that the IGD
> driver in Windows freaks out at finding this strange new BAR on its
> device.  So I'll need to come up with an algorithm for either creating
> a dummy PCI device to host this BAR or trying to add it to other
> existing devices.  It's certainly a more self-contained solution this
> way, so I expect we'll only need patch 1/3 from this series.  Thanks,

Okay.  (I'm not saying patch 3 is bad, but okay.)

If you go through the trouble of mapping the BDSM through a pci bar,
then why not do the same with ASLS then too?

-Kevin
Alex Williamson Feb. 13, 2016, 8:57 p.m. UTC | #6
On Sat, 13 Feb 2016 15:05:09 -0500
"Kevin O'Connor" <kevin@koconnor.net> wrote:

> On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:
> > On Sat, 13 Feb 2016 13:18:39 -0500
> > "Kevin O'Connor" <kevin@koconnor.net> wrote:  
> > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:  
> > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:    
> > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:    
> > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > instead.      
> > > > > 
> > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > rom or something else?    
> > > > 
> > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > ROM.    
> > > 
> > > Is the problem from when the host runs the video option rom, or is the
> > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > the guest is running the option rom, is it the first time the option
> > > rom has been run for the machine (ie, was the option rom not executed
> > > on the host when the host machine first booted)?
> > > 
> > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > number of users have installed SeaBIOS (running natively) on their
> > > machines.  Running the intel video option rom more than once has been
> > > known to cause issues.  
> > 
> > The issue is in the VM and it occurs every time the option ROM is
> > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > string and boot menu), but any sort of extended graphics mode (ex. live
> > CD boot menu) tries to make use of the host memory area which
> > corresponds to the stolen memory area of the physical device.  We're
> > not really sure how the ROM execution arrives at these addresses (not
> > from the device according to access traces), but we can see when the
> > ROM is writing these addresses to the device and modify they addresses
> > to point at a VM memory range which we've allocated.  That's what this
> > code attempts to do, allocate a buffer and tell QEMU about it via the
> > BDSM (Base Data of Stolen Memory) register.  
> 
> Forgive me if I'm not fully understanding this.  If I read what you're
> saying then the sequence is something like:
> 
> 1 - the host system bios (or vgabios) programs the GTT/stolen memory
>     base register at host system bootup time and reserves it in the
>     host e820 map.
> 
> 2 - upon running qemu, the guest reruns the vga bios option rom which
>     seems to work (ie, text mode works)
> 
> 3 - in the guest, upon running a bootloader that uses graphics mode,
>     the bootloader calls the vgabios to switch to graphics mode, and
>     the vgabios sends commands to the graphics hardware that somehow
>     reference the host stolen memory

What exactly happens here isn't clear to me, but this is a plausible
explanation.  What we see in tracing access to the hardware is that a
bunch of addresses are written to the device that fall within the host
e820 reserved area and then the device starts generating IOMMU faults
trying to access those addresses.

> 4 - your patch causes QEMU to catch these commands with references to
>     the host stolen memory and replace them with references to the
>     guest stolen memory (which seabios creates)
> 
> Am I understanding the above correctly?

Yes.
 
> Is the only reason to run the intel option rom in the guest for
> bootloader graphic mode support?  Last time I looked, the intel vga
> hardware could fully emulate a legacy vga device - if the device is in
> vga compatibility mode then it may be possible to have seavgabios
> drive mode changes.

I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
won't turn on without the vBIOS.  Another desktop IvyBridge system
doesn't really care about the vBIOS so long as we don't ask it to
output anything before the guest native drivers are loaded.  If we
could, I think we'd just enable vBIOS for laptop panel support, but
that's really not an option, it's going to run as a boot option ROM as
well, so we need to fix the issues that it generates there.

> > > [...]  
> > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > device I/O port BAR and replaces the host stolen memory address
> > > > with the new guest address generated here.  0x5C is initialized to
> > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > written.  If not written, QEMU has no place to redirect to for
> > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > area.  The former may corrupt VM memory, the latter throws host
> > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > been programmed.    
> > > 
> > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > hardware, but is instead just a mechanism to give QEMU the address of
> > > some guest visible ram?  
> > 
> > It is a real register, BDSM that is virtualized by vfio turning it
> > effectively into a scratch register.  On physical hardware the
> > register is read-only.
> >    
> > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > find the location of the "OpRegion" if SeaBIOS allocates it?  
> > 
> > 0xFC is the ASL Storage register, the guest finds the location of the
> > OpRegion using this register.  This is another register that is
> > read-only on real hardware but virtualized through vfio so we can
> > relocate the OpRegion into the VM address space.
> > 
> > I've found that allocating a dummy MMIO BAR does work as an alternative
> > for mapping space for this stolen memory into the VM address space.
> > For a Linux guest it works to allocate BAR5 on the IGD device.
> > Windows10 is not so happy with this, but does work if I allocate the
> > BAR on something like the ISA bridge device.  My guess is that the IGD
> > driver in Windows freaks out at finding this strange new BAR on its
> > device.  So I'll need to come up with an algorithm for either creating
> > a dummy PCI device to host this BAR or trying to add it to other
> > existing devices.  It's certainly a more self-contained solution this
> > way, so I expect we'll only need patch 1/3 from this series.  Thanks,  
> 
> Okay.  (I'm not saying patch 3 is bad, but okay.)
> 
> If you go through the trouble of mapping the BDSM through a pci bar,
> then why not do the same with ASLS then too?

I suppose we could do that.  There are a few nuances to the fake BAR
solution:

1) The BAR needs to get mapped and not remapped while in use - usually
   not a problem.

2) The guest needs to not disable the device we attach the BARs to,
   which it might do if it doesn't recognize the device.

3) We need to be careful about adding BARs to devices the guest does
   have drivers for or we might overlap real functionality.

4) If we create a dummy device with bogus IDs, it will show up with an
   exclamation mark in device manager, which makes people unhappy.

So from a perspective of being self contained, the fake BAR solution is
very good, but it's not without issue.  I'll try to think of what sort
of dummy device we could create that would always have a guest driver,
but nothing that a couple extra BARs would interfere with.  Maybe a
generic PCI bridge.  Thanks,

Alex
Kevin O'Connor Feb. 14, 2016, 12:20 a.m. UTC | #7
On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:
> On Sat, 13 Feb 2016 15:05:09 -0500
> "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:
> > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > "Kevin O'Connor" <kevin@koconnor.net> wrote:  
> > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:  
> > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:    
> > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:    
> > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > instead.      
> > > > > > 
> > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > rom or something else?    
> > > > > 
> > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > ROM.    
> > > > 
> > > > Is the problem from when the host runs the video option rom, or is the
> > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > the guest is running the option rom, is it the first time the option
> > > > rom has been run for the machine (ie, was the option rom not executed
> > > > on the host when the host machine first booted)?
> > > > 
> > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > number of users have installed SeaBIOS (running natively) on their
> > > > machines.  Running the intel video option rom more than once has been
> > > > known to cause issues.  
> > > 
> > > The issue is in the VM and it occurs every time the option ROM is
> > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > CD boot menu) tries to make use of the host memory area which
> > > corresponds to the stolen memory area of the physical device.  We're
> > > not really sure how the ROM execution arrives at these addresses (not
> > > from the device according to access traces), but we can see when the
> > > ROM is writing these addresses to the device and modify they addresses
> > > to point at a VM memory range which we've allocated.  That's what this
> > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > BDSM (Base Data of Stolen Memory) register.  
> > 
> > Forgive me if I'm not fully understanding this.  If I read what you're
> > saying then the sequence is something like:
> > 
> > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> >     base register at host system bootup time and reserves it in the
> >     host e820 map.
> > 
> > 2 - upon running qemu, the guest reruns the vga bios option rom which
> >     seems to work (ie, text mode works)
> > 
> > 3 - in the guest, upon running a bootloader that uses graphics mode,
> >     the bootloader calls the vgabios to switch to graphics mode, and
> >     the vgabios sends commands to the graphics hardware that somehow
> >     reference the host stolen memory
> 
> What exactly happens here isn't clear to me, but this is a plausible
> explanation.  What we see in tracing access to the hardware is that a
> bunch of addresses are written to the device that fall within the host
> e820 reserved area and then the device starts generating IOMMU faults
> trying to access those addresses.
> 
> > 4 - your patch causes QEMU to catch these commands with references to
> >     the host stolen memory and replace them with references to the
> >     guest stolen memory (which seabios creates)
> > 
> > Am I understanding the above correctly?
> 
> Yes.
>  
> > Is the only reason to run the intel option rom in the guest for
> > bootloader graphic mode support?  Last time I looked, the intel vga
> > hardware could fully emulate a legacy vga device - if the device is in
> > vga compatibility mode then it may be possible to have seavgabios
> > drive mode changes.
> 
> I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> won't turn on without the vBIOS.

This confuses me - why didn't the host system BIOS turn on the LCD
panel during host bootup?

>Another desktop IvyBridge system
> doesn't really care about the vBIOS so long as we don't ask it to
> output anything before the guest native drivers are loaded.  If we
> could, I think we'd just enable vBIOS for laptop panel support, but
> that's really not an option, it's going to run as a boot option ROM as
> well, so we need to fix the issues that it generates there.

From my experience with coreboot, running the vga option rom multiple
times during a given boot is very fragile.  (By multiple times, I mean
either the host running it and then a guest, or running it multiple
times from multiple guests.)  YMMV.

> > > > [...]  
> > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > been programmed.    
> > > > 
> > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > some guest visible ram?  
> > > 
> > > It is a real register, BDSM that is virtualized by vfio turning it
> > > effectively into a scratch register.  On physical hardware the
> > > register is read-only.
> > >    
> > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > find the location of the "OpRegion" if SeaBIOS allocates it?  
> > > 
> > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > OpRegion using this register.  This is another register that is
> > > read-only on real hardware but virtualized through vfio so we can
> > > relocate the OpRegion into the VM address space.
> > > 
> > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > for mapping space for this stolen memory into the VM address space.
> > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > Windows10 is not so happy with this, but does work if I allocate the
> > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > driver in Windows freaks out at finding this strange new BAR on its
> > > device.  So I'll need to come up with an algorithm for either creating
> > > a dummy PCI device to host this BAR or trying to add it to other
> > > existing devices.  It's certainly a more self-contained solution this
> > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,  
> > 
> > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > 
> > If you go through the trouble of mapping the BDSM through a pci bar,
> > then why not do the same with ASLS then too?
> 
> I suppose we could do that.  There are a few nuances to the fake BAR
> solution:
> 
> 1) The BAR needs to get mapped and not remapped while in use - usually
>    not a problem.
> 
> 2) The guest needs to not disable the device we attach the BARs to,
>    which it might do if it doesn't recognize the device.
> 
> 3) We need to be careful about adding BARs to devices the guest does
>    have drivers for or we might overlap real functionality.
> 
> 4) If we create a dummy device with bogus IDs, it will show up with an
>    exclamation mark in device manager, which makes people unhappy.
> 
> So from a perspective of being self contained, the fake BAR solution is
> very good, but it's not without issue.  I'll try to think of what sort
> of dummy device we could create that would always have a guest driver,
> but nothing that a couple extra BARs would interfere with.  Maybe a
> generic PCI bridge.  Thanks,

Okay.  Again, I'm not stating a preferred direction.

BTW, I wonder if the recent discussion between Michael and Igor is
relevant here:
  https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html

-Kevin
Alex Williamson Feb. 14, 2016, 1:03 a.m. UTC | #8
On Sat, 13 Feb 2016 19:20:32 -0500
"Kevin O'Connor" <kevin@koconnor.net> wrote:

> On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:
> > On Sat, 13 Feb 2016 15:05:09 -0500
> > "Kevin O'Connor" <kevin@koconnor.net> wrote:  
> > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:  
> > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:    
> > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:    
> > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:      
> > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:      
> > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > > instead.        
> > > > > > > 
> > > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > > rom or something else?      
> > > > > > 
> > > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > > ROM.      
> > > > > 
> > > > > Is the problem from when the host runs the video option rom, or is the
> > > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > > the guest is running the option rom, is it the first time the option
> > > > > rom has been run for the machine (ie, was the option rom not executed
> > > > > on the host when the host machine first booted)?
> > > > > 
> > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > > number of users have installed SeaBIOS (running natively) on their
> > > > > machines.  Running the intel video option rom more than once has been
> > > > > known to cause issues.    
> > > > 
> > > > The issue is in the VM and it occurs every time the option ROM is
> > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > > CD boot menu) tries to make use of the host memory area which
> > > > corresponds to the stolen memory area of the physical device.  We're
> > > > not really sure how the ROM execution arrives at these addresses (not
> > > > from the device according to access traces), but we can see when the
> > > > ROM is writing these addresses to the device and modify they addresses
> > > > to point at a VM memory range which we've allocated.  That's what this
> > > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > > BDSM (Base Data of Stolen Memory) register.    
> > > 
> > > Forgive me if I'm not fully understanding this.  If I read what you're
> > > saying then the sequence is something like:
> > > 
> > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > >     base register at host system bootup time and reserves it in the
> > >     host e820 map.
> > > 
> > > 2 - upon running qemu, the guest reruns the vga bios option rom which
> > >     seems to work (ie, text mode works)
> > > 
> > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > >     the bootloader calls the vgabios to switch to graphics mode, and
> > >     the vgabios sends commands to the graphics hardware that somehow
> > >     reference the host stolen memory  
> > 
> > What exactly happens here isn't clear to me, but this is a plausible
> > explanation.  What we see in tracing access to the hardware is that a
> > bunch of addresses are written to the device that fall within the host
> > e820 reserved area and then the device starts generating IOMMU faults
> > trying to access those addresses.
> >   
> > > 4 - your patch causes QEMU to catch these commands with references to
> > >     the host stolen memory and replace them with references to the
> > >     guest stolen memory (which seabios creates)
> > > 
> > > Am I understanding the above correctly?  
> > 
> > Yes.
> >    
> > > Is the only reason to run the intel option rom in the guest for
> > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > hardware could fully emulate a legacy vga device - if the device is in
> > > vga compatibility mode then it may be possible to have seavgabios
> > > drive mode changes.  
> > 
> > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > won't turn on without the vBIOS.  
> 
> This confuses me - why didn't the host system BIOS turn on the LCD
> panel during host bootup?

It turns off when we reset the device between VM instances or between
VM boots.  IGD supports Function Level Reset (FLR).

> >Another desktop IvyBridge system
> > doesn't really care about the vBIOS so long as we don't ask it to
> > output anything before the guest native drivers are loaded.  If we
> > could, I think we'd just enable vBIOS for laptop panel support, but
> > that's really not an option, it's going to run as a boot option ROM as
> > well, so we need to fix the issues that it generates there.  
> 
> From my experience with coreboot, running the vga option rom multiple
> times during a given boot is very fragile.  (By multiple times, I mean
> either the host running it and then a guest, or running it multiple
> times from multiple guests.)  YMMV.

We do this regularly for graphics assignment, Nvidia, AMD, and now
Intel.  It generally works ok.  Perhaps you've seen issues with the
option ROM being run multiple times without resetting the device.  I
could certainly believe that.  We only have one blacklisted Broadcom
ROM in vfio, probably due to missing or incomplete device reset method.
 
> > > > > [...]    
> > > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > > been programmed.      
> > > > > 
> > > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > > some guest visible ram?    
> > > > 
> > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > effectively into a scratch register.  On physical hardware the
> > > > register is read-only.
> > > >      
> > > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > > find the location of the "OpRegion" if SeaBIOS allocates it?    
> > > > 
> > > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > > OpRegion using this register.  This is another register that is
> > > > read-only on real hardware but virtualized through vfio so we can
> > > > relocate the OpRegion into the VM address space.
> > > > 
> > > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > > for mapping space for this stolen memory into the VM address space.
> > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > Windows10 is not so happy with this, but does work if I allocate the
> > > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > > driver in Windows freaks out at finding this strange new BAR on its
> > > > device.  So I'll need to come up with an algorithm for either creating
> > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > existing devices.  It's certainly a more self-contained solution this
> > > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,    
> > > 
> > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > 
> > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > then why not do the same with ASLS then too?  
> > 
> > I suppose we could do that.  There are a few nuances to the fake BAR
> > solution:
> > 
> > 1) The BAR needs to get mapped and not remapped while in use - usually
> >    not a problem.
> > 
> > 2) The guest needs to not disable the device we attach the BARs to,
> >    which it might do if it doesn't recognize the device.
> > 
> > 3) We need to be careful about adding BARs to devices the guest does
> >    have drivers for or we might overlap real functionality.
> > 
> > 4) If we create a dummy device with bogus IDs, it will show up with an
> >    exclamation mark in device manager, which makes people unhappy.
> > 
> > So from a perspective of being self contained, the fake BAR solution is
> > very good, but it's not without issue.  I'll try to think of what sort
> > of dummy device we could create that would always have a guest driver,
> > but nothing that a couple extra BARs would interfere with.  Maybe a
> > generic PCI bridge.  Thanks,  
> 
> Okay.  Again, I'm not stating a preferred direction.
> 
> BTW, I wonder if the recent discussion between Michael and Igor is
> relevant here:
>   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html

I'm certainly open to rebuttals against this approach, but I do have it
working.  Being entirely self contained is pretty intriguing.
Theoretically this would allow us to work with OVMF with no
modifications.  Linux guests enable and disable devices several times
during boot (per the spec, any time the BAR is sized it should be
disabled), Windows never seems to disable the device.  The LPC/ISA
bridge seems to be the best place to put these BARs, we need to create
that anyway for pre-Broadwell/Skylake and the device itself has no
implemented BARs. The ISA bridge is just a shell device to keep the
driver happy on those older chips, so squatting on a couple BARs
doesn't seem too terrible. Thanks,

Alex
Igor Mammedov Feb. 15, 2016, 9:54 a.m. UTC | #9
On Sat, 13 Feb 2016 18:03:31 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sat, 13 Feb 2016 19:20:32 -0500
> "Kevin O'Connor" <kevin@koconnor.net> wrote:
> 
> > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:  
> > > On Sat, 13 Feb 2016 15:05:09 -0500
> > > "Kevin O'Connor" <kevin@koconnor.net> wrote:    
> > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:    
> > > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:      
> > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:      
> > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:        
> > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:        
> > > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > > > instead.          
> > > > > > > > 
> > > > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > > > rom or something else?        
> > > > > > > 
> > > > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > > > ROM.        
> > > > > > 
> > > > > > Is the problem from when the host runs the video option rom, or is the
> > > > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > > > the guest is running the option rom, is it the first time the option
> > > > > > rom has been run for the machine (ie, was the option rom not executed
> > > > > > on the host when the host machine first booted)?
> > > > > > 
> > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > > > number of users have installed SeaBIOS (running natively) on their
> > > > > > machines.  Running the intel video option rom more than once has been
> > > > > > known to cause issues.      
> > > > > 
> > > > > The issue is in the VM and it occurs every time the option ROM is
> > > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > > > CD boot menu) tries to make use of the host memory area which
> > > > > corresponds to the stolen memory area of the physical device.  We're
> > > > > not really sure how the ROM execution arrives at these addresses (not
> > > > > from the device according to access traces), but we can see when the
> > > > > ROM is writing these addresses to the device and modify they addresses
> > > > > to point at a VM memory range which we've allocated.  That's what this
> > > > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > > > BDSM (Base Data of Stolen Memory) register.      
> > > > 
> > > > Forgive me if I'm not fully understanding this.  If I read what you're
> > > > saying then the sequence is something like:
> > > > 
> > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > > >     base register at host system bootup time and reserves it in the
> > > >     host e820 map.
> > > > 
> > > > 2 - upon running qemu, the guest reruns the vga bios option rom which
> > > >     seems to work (ie, text mode works)
> > > > 
> > > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > > >     the bootloader calls the vgabios to switch to graphics mode, and
> > > >     the vgabios sends commands to the graphics hardware that somehow
> > > >     reference the host stolen memory    
> > > 
> > > What exactly happens here isn't clear to me, but this is a plausible
> > > explanation.  What we see in tracing access to the hardware is that a
> > > bunch of addresses are written to the device that fall within the host
> > > e820 reserved area and then the device starts generating IOMMU faults
> > > trying to access those addresses.
> > >     
> > > > 4 - your patch causes QEMU to catch these commands with references to
> > > >     the host stolen memory and replace them with references to the
> > > >     guest stolen memory (which seabios creates)
> > > > 
> > > > Am I understanding the above correctly?    
> > > 
> > > Yes.
> > >      
> > > > Is the only reason to run the intel option rom in the guest for
> > > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > > hardware could fully emulate a legacy vga device - if the device is in
> > > > vga compatibility mode then it may be possible to have seavgabios
> > > > drive mode changes.    
> > > 
> > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > > won't turn on without the vBIOS.    
> > 
> > This confuses me - why didn't the host system BIOS turn on the LCD
> > panel during host bootup?  
> 
> It turns off when we reset the device between VM instances or between
> VM boots.  IGD supports Function Level Reset (FLR).
> 
> > >Another desktop IvyBridge system
> > > doesn't really care about the vBIOS so long as we don't ask it to
> > > output anything before the guest native drivers are loaded.  If we
> > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > that's really not an option, it's going to run as a boot option ROM as
> > > well, so we need to fix the issues that it generates there.    
> > 
> > From my experience with coreboot, running the vga option rom multiple
> > times during a given boot is very fragile.  (By multiple times, I mean
> > either the host running it and then a guest, or running it multiple
> > times from multiple guests.)  YMMV.  
> 
> We do this regularly for graphics assignment, Nvidia, AMD, and now
> Intel.  It generally works ok.  Perhaps you've seen issues with the
> option ROM being run multiple times without resetting the device.  I
> could certainly believe that.  We only have one blacklisted Broadcom
> ROM in vfio, probably due to missing or incomplete device reset method.
>  
> > > > > > [...]      
> > > > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > > > been programmed.        
> > > > > > 
> > > > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > > > some guest visible ram?      
> > > > > 
> > > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > > effectively into a scratch register.  On physical hardware the
> > > > > register is read-only.
> > > > >        
> > > > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > > > find the location of the "OpRegion" if SeaBIOS allocates it?      
> > > > > 
> > > > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > > > OpRegion using this register.  This is another register that is
> > > > > read-only on real hardware but virtualized through vfio so we can
> > > > > relocate the OpRegion into the VM address space.
> > > > > 
> > > > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > > > for mapping space for this stolen memory into the VM address space.
> > > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > > Windows10 is not so happy with this, but does work if I allocate the
> > > > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > > > driver in Windows freaks out at finding this strange new BAR on its
> > > > > device.  So I'll need to come up with an algorithm for either creating
> > > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > > existing devices.  It's certainly a more self-contained solution this
> > > > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,      
> > > > 
> > > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > > 
> > > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > > then why not do the same with ASLS then too?    
> > > 
> > > I suppose we could do that.  There are a few nuances to the fake BAR
> > > solution:
> > > 
> > > 1) The BAR needs to get mapped and not remapped while in use - usually
> > >    not a problem.
> > > 
> > > 2) The guest needs to not disable the device we attach the BARs to,
> > >    which it might do if it doesn't recognize the device.
> > > 
> > > 3) We need to be careful about adding BARs to devices the guest does
> > >    have drivers for or we might overlap real functionality.
> > > 
> > > 4) If we create a dummy device with bogus IDs, it will show up with an
> > >    exclamation mark in device manager, which makes people unhappy.
> > > 
> > > So from a perspective of being self contained, the fake BAR solution is
> > > very good, but it's not without issue.  I'll try to think of what sort
> > > of dummy device we could create that would always have a guest driver,
> > > but nothing that a couple extra BARs would interfere with.  Maybe a
> > > generic PCI bridge.  Thanks,    
> > 
> > Okay.  Again, I'm not stating a preferred direction.
> > 
> > BTW, I wonder if the recent discussion between Michael and Igor is
> > relevant here:
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html  
> 
> I'm certainly open to rebuttals against this approach, but I do have it
> working.  Being entirely self contained is pretty intriguing.
> Theoretically this would allow us to work with OVMF with no
> modifications.  Linux guests enable and disable devices several times
> during boot (per the spec, any time the BAR is sized it should be
> disabled), Windows never seems to disable the device.  The LPC/ISA
> bridge seems to be the best place to put these BARs, we need to create
> that anyway for pre-Broadwell/Skylake and the device itself has no
> implemented BARs. The ISA bridge is just a shell device to keep the
> driver happy on those older chips, so squatting on a couple BARs
> doesn't seem too terrible. Thanks,
I'm not sure what way Windows would behave wrt rebalancing
if that fake bar is added to LPC/ISA bridge and I can think of
a way to test and verify it either.

In above discussion Michael dislikes that allocating GPA in QEMU,
while I dislike stealing guest's RAM for just getting the same GPA.
Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA
without stealing guest's RAM somewhere in free address space and
tell QEMU about it.
Michael suggested to do something similar for stolen RAM in
bios_linker interface only, which is a part of SeaBIOS/OVMF
but maybe making it more generic PV interface might be even
more useful. That way we won't have to implement fake PCI
devices/BARs nor hijack existing PCI devices.
Gerd Hoffmann Feb. 15, 2016, 10:31 a.m. UTC | #10
Hi,

> The issue is in the VM and it occurs every time the option ROM is
> executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> string and boot menu), but any sort of extended graphics mode (ex. live
> CD boot menu) tries to make use of the host memory area which
> corresponds to the stolen memory area of the physical device.

I'm wondering whenever we can just use seavgabios (and support standard
vga modes only).

cheers,
  Gerd
Kevin O'Connor Feb. 15, 2016, 2:15 p.m. UTC | #11
On Sat, Feb 13, 2016 at 06:03:31PM -0700, Alex Williamson wrote:
> On Sat, 13 Feb 2016 19:20:32 -0500
> "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > This confuses me - why didn't the host system BIOS turn on the LCD
> > panel during host bootup?
> 
> It turns off when we reset the device between VM instances or between
> VM boots.  IGD supports Function Level Reset (FLR).
> 
> > >Another desktop IvyBridge system
> > > doesn't really care about the vBIOS so long as we don't ask it to
> > > output anything before the guest native drivers are loaded.  If we
> > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > that's really not an option, it's going to run as a boot option ROM as
> > > well, so we need to fix the issues that it generates there.  
> > 
> > From my experience with coreboot, running the vga option rom multiple
> > times during a given boot is very fragile.  (By multiple times, I mean
> > either the host running it and then a guest, or running it multiple
> > times from multiple guests.)  YMMV.
> 
> We do this regularly for graphics assignment, Nvidia, AMD, and now
> Intel.  It generally works ok.  Perhaps you've seen issues with the
> option ROM being run multiple times without resetting the device.  I
> could certainly believe that.

Interesting.  Using the FLR could be useful on some coreboot systems
where users want to run the option rom twice.  Thanks.

-Kevin
Alex Williamson Feb. 15, 2016, 5:46 p.m. UTC | #12
On Mon, 15 Feb 2016 11:31:41 +0100
Gerd Hoffmann <kraxel@redhat.com> wrote:

>   Hi,
> 
> > The issue is in the VM and it occurs every time the option ROM is
> > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > string and boot menu), but any sort of extended graphics mode (ex. live
> > CD boot menu) tries to make use of the host memory area which
> > corresponds to the stolen memory area of the physical device.  
> 
> I'm wondering whenever we can just use seavgabios (and support standard
> vga modes only).

The reason we need the vBIOS is to turn on the LCD panel, which in turn
has this dependency on the stolen memory area.  I'd be pretty surprised
if a generic VGA BIOS affected the panel in any way.  Maybe seavgabios
could be used for desktop IGD assignment, though I don't really know
how to detect an attached panel.  Thanks,

Alex
Alex Williamson Feb. 15, 2016, 7:20 p.m. UTC | #13
On Mon, 15 Feb 2016 10:54:51 +0100
Igor Mammedov <imammedo@redhat.com> wrote:

> On Sat, 13 Feb 2016 18:03:31 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Sat, 13 Feb 2016 19:20:32 -0500
> > "Kevin O'Connor" <kevin@koconnor.net> wrote:
> >   
> > > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:    
> > > > On Sat, 13 Feb 2016 15:05:09 -0500
> > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:      
> > > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:      
> > > > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:        
> > > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:        
> > > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:          
> > > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:          
> > > > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > > > > instead.            
> > > > > > > > > 
> > > > > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > > > > rom or something else?          
> > > > > > > > 
> > > > > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > > > > ROM.          
> > > > > > > 
> > > > > > > Is the problem from when the host runs the video option rom, or is the
> > > > > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > > > > the guest is running the option rom, is it the first time the option
> > > > > > > rom has been run for the machine (ie, was the option rom not executed
> > > > > > > on the host when the host machine first booted)?
> > > > > > > 
> > > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > > > > number of users have installed SeaBIOS (running natively) on their
> > > > > > > machines.  Running the intel video option rom more than once has been
> > > > > > > known to cause issues.        
> > > > > > 
> > > > > > The issue is in the VM and it occurs every time the option ROM is
> > > > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > > > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > > > > CD boot menu) tries to make use of the host memory area which
> > > > > > corresponds to the stolen memory area of the physical device.  We're
> > > > > > not really sure how the ROM execution arrives at these addresses (not
> > > > > > from the device according to access traces), but we can see when the
> > > > > > ROM is writing these addresses to the device and modify they addresses
> > > > > > to point at a VM memory range which we've allocated.  That's what this
> > > > > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > > > > BDSM (Base Data of Stolen Memory) register.        
> > > > > 
> > > > > Forgive me if I'm not fully understanding this.  If I read what you're
> > > > > saying then the sequence is something like:
> > > > > 
> > > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > > > >     base register at host system bootup time and reserves it in the
> > > > >     host e820 map.
> > > > > 
> > > > > 2 - upon running qemu, the guest reruns the vga bios option rom which
> > > > >     seems to work (ie, text mode works)
> > > > > 
> > > > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > > > >     the bootloader calls the vgabios to switch to graphics mode, and
> > > > >     the vgabios sends commands to the graphics hardware that somehow
> > > > >     reference the host stolen memory      
> > > > 
> > > > What exactly happens here isn't clear to me, but this is a plausible
> > > > explanation.  What we see in tracing access to the hardware is that a
> > > > bunch of addresses are written to the device that fall within the host
> > > > e820 reserved area and then the device starts generating IOMMU faults
> > > > trying to access those addresses.
> > > >       
> > > > > 4 - your patch causes QEMU to catch these commands with references to
> > > > >     the host stolen memory and replace them with references to the
> > > > >     guest stolen memory (which seabios creates)
> > > > > 
> > > > > Am I understanding the above correctly?      
> > > > 
> > > > Yes.
> > > >        
> > > > > Is the only reason to run the intel option rom in the guest for
> > > > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > > > hardware could fully emulate a legacy vga device - if the device is in
> > > > > vga compatibility mode then it may be possible to have seavgabios
> > > > > drive mode changes.      
> > > > 
> > > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > > > won't turn on without the vBIOS.      
> > > 
> > > This confuses me - why didn't the host system BIOS turn on the LCD
> > > panel during host bootup?    
> > 
> > It turns off when we reset the device between VM instances or between
> > VM boots.  IGD supports Function Level Reset (FLR).
> >   
> > > >Another desktop IvyBridge system
> > > > doesn't really care about the vBIOS so long as we don't ask it to
> > > > output anything before the guest native drivers are loaded.  If we
> > > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > > that's really not an option, it's going to run as a boot option ROM as
> > > > well, so we need to fix the issues that it generates there.      
> > > 
> > > From my experience with coreboot, running the vga option rom multiple
> > > times during a given boot is very fragile.  (By multiple times, I mean
> > > either the host running it and then a guest, or running it multiple
> > > times from multiple guests.)  YMMV.    
> > 
> > We do this regularly for graphics assignment, Nvidia, AMD, and now
> > Intel.  It generally works ok.  Perhaps you've seen issues with the
> > option ROM being run multiple times without resetting the device.  I
> > could certainly believe that.  We only have one blacklisted Broadcom
> > ROM in vfio, probably due to missing or incomplete device reset method.
> >    
> > > > > > > [...]        
> > > > > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > > > > been programmed.          
> > > > > > > 
> > > > > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > > > > some guest visible ram?        
> > > > > > 
> > > > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > > > effectively into a scratch register.  On physical hardware the
> > > > > > register is read-only.
> > > > > >          
> > > > > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > > > > find the location of the "OpRegion" if SeaBIOS allocates it?        
> > > > > > 
> > > > > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > > > > OpRegion using this register.  This is another register that is
> > > > > > read-only on real hardware but virtualized through vfio so we can
> > > > > > relocate the OpRegion into the VM address space.
> > > > > > 
> > > > > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > > > > for mapping space for this stolen memory into the VM address space.
> > > > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > > > Windows10 is not so happy with this, but does work if I allocate the
> > > > > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > > > > driver in Windows freaks out at finding this strange new BAR on its
> > > > > > device.  So I'll need to come up with an algorithm for either creating
> > > > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > > > existing devices.  It's certainly a more self-contained solution this
> > > > > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,        
> > > > > 
> > > > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > > > 
> > > > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > > > then why not do the same with ASLS then too?      
> > > > 
> > > > I suppose we could do that.  There are a few nuances to the fake BAR
> > > > solution:
> > > > 
> > > > 1) The BAR needs to get mapped and not remapped while in use - usually
> > > >    not a problem.
> > > > 
> > > > 2) The guest needs to not disable the device we attach the BARs to,
> > > >    which it might do if it doesn't recognize the device.
> > > > 
> > > > 3) We need to be careful about adding BARs to devices the guest does
> > > >    have drivers for or we might overlap real functionality.
> > > > 
> > > > 4) If we create a dummy device with bogus IDs, it will show up with an
> > > >    exclamation mark in device manager, which makes people unhappy.
> > > > 
> > > > So from a perspective of being self contained, the fake BAR solution is
> > > > very good, but it's not without issue.  I'll try to think of what sort
> > > > of dummy device we could create that would always have a guest driver,
> > > > but nothing that a couple extra BARs would interfere with.  Maybe a
> > > > generic PCI bridge.  Thanks,      
> > > 
> > > Okay.  Again, I'm not stating a preferred direction.
> > > 
> > > BTW, I wonder if the recent discussion between Michael and Igor is
> > > relevant here:
> > >   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html    
> > 
> > I'm certainly open to rebuttals against this approach, but I do have it
> > working.  Being entirely self contained is pretty intriguing.
> > Theoretically this would allow us to work with OVMF with no
> > modifications.  Linux guests enable and disable devices several times
> > during boot (per the spec, any time the BAR is sized it should be
> > disabled), Windows never seems to disable the device.  The LPC/ISA
> > bridge seems to be the best place to put these BARs, we need to create
> > that anyway for pre-Broadwell/Skylake and the device itself has no
> > implemented BARs. The ISA bridge is just a shell device to keep the
> > driver happy on those older chips, so squatting on a couple BARs
> > doesn't seem too terrible. Thanks,  
> I'm not sure what way Windows would behave wrt rebalancing
> if that fake bar is added to LPC/ISA bridge and I can think of
> a way to test and verify it either.
> 
> In above discussion Michael dislikes that allocating GPA in QEMU,
> while I dislike stealing guest's RAM for just getting the same GPA.
> Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA
> without stealing guest's RAM somewhere in free address space and
> tell QEMU about it.
> Michael suggested to do something similar for stolen RAM in
> bios_linker interface only, which is a part of SeaBIOS/OVMF
> but maybe making it more generic PV interface might be even
> more useful. That way we won't have to implement fake PCI
> devices/BARs nor hijack existing PCI devices.

Well, I was hoping that your requirements that the address cannot move
or be disabled was more strict that mine, but I think I've just proven
it's not.  I prevented the native drivers from loading in the guest
(pci-stubs.ids=8086:0126), reserved the memory that SeaBIOS was
programming for the ISA bridge BARs (memmap=2M#0xf1000000), told PCI to
reassign devices (pci=realloc), and booted with a VESA mode (vga=ask).
If I get all those things to align then VESA mode makes use of the
stolen memory assigned at vBIOS init and when the guest moves it we get
garbage on the screen and a spew of IOMMU faults in the host.  Even if
I don't do the reserved/realloc thing, I get a single IOMMU fault in
the host, which I assume is an access that occurs while the BAR is
being sized and therefore disabled.  That would be a more realistic
scenario of an install program not having native support for the
GPU.  Darn, this PCI hack was growing on me.

I don't mind the allocation of guest memory issue, the fact that guest
memory is consumed by built-in devices is exactly what happens on bare
metal.  The only part I don't like about the BIOS assigning the GPA is
that we need support in each BIOS to do that.  I think we had already
come up with a pretty good way to handle the opregion using
etc/igd-opregion to fill the space, for stolen memory, I think we're
just going to need to do something similar, maybe even via a dummy
fw_cfg file so that SeaBIOS not only knows the size, but if we needed
to pre-populate it, we could.  It at least puts QEMU in control, but
relaying back the address via a device specific register is still a bit
ugly.  Thanks,

Alex
Michael S. Tsirkin Feb. 15, 2016, 7:29 p.m. UTC | #14
On Mon, Feb 15, 2016 at 12:20:23PM -0700, Alex Williamson wrote:
> On Mon, 15 Feb 2016 10:54:51 +0100
> Igor Mammedov <imammedo@redhat.com> wrote:
> 
> > On Sat, 13 Feb 2016 18:03:31 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Sat, 13 Feb 2016 19:20:32 -0500
> > > "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > >   
> > > > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:    
> > > > > On Sat, 13 Feb 2016 15:05:09 -0500
> > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:      
> > > > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:      
> > > > > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:        
> > > > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:        
> > > > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:          
> > > > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:          
> > > > > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > > > > > instead.            
> > > > > > > > > > 
> > > > > > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > > > > > rom or something else?          
> > > > > > > > > 
> > > > > > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > > > > > ROM.          
> > > > > > > > 
> > > > > > > > Is the problem from when the host runs the video option rom, or is the
> > > > > > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > > > > > the guest is running the option rom, is it the first time the option
> > > > > > > > rom has been run for the machine (ie, was the option rom not executed
> > > > > > > > on the host when the host machine first booted)?
> > > > > > > > 
> > > > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > > > > > number of users have installed SeaBIOS (running natively) on their
> > > > > > > > machines.  Running the intel video option rom more than once has been
> > > > > > > > known to cause issues.        
> > > > > > > 
> > > > > > > The issue is in the VM and it occurs every time the option ROM is
> > > > > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > > > > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > > > > > CD boot menu) tries to make use of the host memory area which
> > > > > > > corresponds to the stolen memory area of the physical device.  We're
> > > > > > > not really sure how the ROM execution arrives at these addresses (not
> > > > > > > from the device according to access traces), but we can see when the
> > > > > > > ROM is writing these addresses to the device and modify they addresses
> > > > > > > to point at a VM memory range which we've allocated.  That's what this
> > > > > > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > > > > > BDSM (Base Data of Stolen Memory) register.        
> > > > > > 
> > > > > > Forgive me if I'm not fully understanding this.  If I read what you're
> > > > > > saying then the sequence is something like:
> > > > > > 
> > > > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > > > > >     base register at host system bootup time and reserves it in the
> > > > > >     host e820 map.
> > > > > > 
> > > > > > 2 - upon running qemu, the guest reruns the vga bios option rom which
> > > > > >     seems to work (ie, text mode works)
> > > > > > 
> > > > > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > > > > >     the bootloader calls the vgabios to switch to graphics mode, and
> > > > > >     the vgabios sends commands to the graphics hardware that somehow
> > > > > >     reference the host stolen memory      
> > > > > 
> > > > > What exactly happens here isn't clear to me, but this is a plausible
> > > > > explanation.  What we see in tracing access to the hardware is that a
> > > > > bunch of addresses are written to the device that fall within the host
> > > > > e820 reserved area and then the device starts generating IOMMU faults
> > > > > trying to access those addresses.
> > > > >       
> > > > > > 4 - your patch causes QEMU to catch these commands with references to
> > > > > >     the host stolen memory and replace them with references to the
> > > > > >     guest stolen memory (which seabios creates)
> > > > > > 
> > > > > > Am I understanding the above correctly?      
> > > > > 
> > > > > Yes.
> > > > >        
> > > > > > Is the only reason to run the intel option rom in the guest for
> > > > > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > > > > hardware could fully emulate a legacy vga device - if the device is in
> > > > > > vga compatibility mode then it may be possible to have seavgabios
> > > > > > drive mode changes.      
> > > > > 
> > > > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > > > > won't turn on without the vBIOS.      
> > > > 
> > > > This confuses me - why didn't the host system BIOS turn on the LCD
> > > > panel during host bootup?    
> > > 
> > > It turns off when we reset the device between VM instances or between
> > > VM boots.  IGD supports Function Level Reset (FLR).
> > >   
> > > > >Another desktop IvyBridge system
> > > > > doesn't really care about the vBIOS so long as we don't ask it to
> > > > > output anything before the guest native drivers are loaded.  If we
> > > > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > > > that's really not an option, it's going to run as a boot option ROM as
> > > > > well, so we need to fix the issues that it generates there.      
> > > > 
> > > > From my experience with coreboot, running the vga option rom multiple
> > > > times during a given boot is very fragile.  (By multiple times, I mean
> > > > either the host running it and then a guest, or running it multiple
> > > > times from multiple guests.)  YMMV.    
> > > 
> > > We do this regularly for graphics assignment, Nvidia, AMD, and now
> > > Intel.  It generally works ok.  Perhaps you've seen issues with the
> > > option ROM being run multiple times without resetting the device.  I
> > > could certainly believe that.  We only have one blacklisted Broadcom
> > > ROM in vfio, probably due to missing or incomplete device reset method.
> > >    
> > > > > > > > [...]        
> > > > > > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > > > > > been programmed.          
> > > > > > > > 
> > > > > > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > > > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > > > > > some guest visible ram?        
> > > > > > > 
> > > > > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > > > > effectively into a scratch register.  On physical hardware the
> > > > > > > register is read-only.
> > > > > > >          
> > > > > > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > > > > > find the location of the "OpRegion" if SeaBIOS allocates it?        
> > > > > > > 
> > > > > > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > > > > > OpRegion using this register.  This is another register that is
> > > > > > > read-only on real hardware but virtualized through vfio so we can
> > > > > > > relocate the OpRegion into the VM address space.
> > > > > > > 
> > > > > > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > > > > > for mapping space for this stolen memory into the VM address space.
> > > > > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > > > > Windows10 is not so happy with this, but does work if I allocate the
> > > > > > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > > > > > driver in Windows freaks out at finding this strange new BAR on its
> > > > > > > device.  So I'll need to come up with an algorithm for either creating
> > > > > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > > > > existing devices.  It's certainly a more self-contained solution this
> > > > > > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,        
> > > > > > 
> > > > > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > > > > 
> > > > > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > > > > then why not do the same with ASLS then too?      
> > > > > 
> > > > > I suppose we could do that.  There are a few nuances to the fake BAR
> > > > > solution:
> > > > > 
> > > > > 1) The BAR needs to get mapped and not remapped while in use - usually
> > > > >    not a problem.
> > > > > 
> > > > > 2) The guest needs to not disable the device we attach the BARs to,
> > > > >    which it might do if it doesn't recognize the device.
> > > > > 
> > > > > 3) We need to be careful about adding BARs to devices the guest does
> > > > >    have drivers for or we might overlap real functionality.
> > > > > 
> > > > > 4) If we create a dummy device with bogus IDs, it will show up with an
> > > > >    exclamation mark in device manager, which makes people unhappy.
> > > > > 
> > > > > So from a perspective of being self contained, the fake BAR solution is
> > > > > very good, but it's not without issue.  I'll try to think of what sort
> > > > > of dummy device we could create that would always have a guest driver,
> > > > > but nothing that a couple extra BARs would interfere with.  Maybe a
> > > > > generic PCI bridge.  Thanks,      
> > > > 
> > > > Okay.  Again, I'm not stating a preferred direction.
> > > > 
> > > > BTW, I wonder if the recent discussion between Michael and Igor is
> > > > relevant here:
> > > >   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html    
> > > 
> > > I'm certainly open to rebuttals against this approach, but I do have it
> > > working.  Being entirely self contained is pretty intriguing.
> > > Theoretically this would allow us to work with OVMF with no
> > > modifications.  Linux guests enable and disable devices several times
> > > during boot (per the spec, any time the BAR is sized it should be
> > > disabled), Windows never seems to disable the device.  The LPC/ISA
> > > bridge seems to be the best place to put these BARs, we need to create
> > > that anyway for pre-Broadwell/Skylake and the device itself has no
> > > implemented BARs. The ISA bridge is just a shell device to keep the
> > > driver happy on those older chips, so squatting on a couple BARs
> > > doesn't seem too terrible. Thanks,  
> > I'm not sure what way Windows would behave wrt rebalancing
> > if that fake bar is added to LPC/ISA bridge and I can think of
> > a way to test and verify it either.
> > 
> > In above discussion Michael dislikes that allocating GPA in QEMU,
> > while I dislike stealing guest's RAM for just getting the same GPA.
> > Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA
> > without stealing guest's RAM somewhere in free address space and
> > tell QEMU about it.
> > Michael suggested to do something similar for stolen RAM in
> > bios_linker interface only, which is a part of SeaBIOS/OVMF
> > but maybe making it more generic PV interface might be even
> > more useful. That way we won't have to implement fake PCI
> > devices/BARs nor hijack existing PCI devices.
> 
> Well, I was hoping that your requirements that the address cannot move
> or be disabled was more strict that mine, but I think I've just proven
> it's not.  I prevented the native drivers from loading in the guest
> (pci-stubs.ids=8086:0126), reserved the memory that SeaBIOS was
> programming for the ISA bridge BARs (memmap=2M#0xf1000000), told PCI to
> reassign devices (pci=realloc), and booted with a VESA mode (vga=ask).
> If I get all those things to align then VESA mode makes use of the
> stolen memory assigned at vBIOS init and when the guest moves it we get
> garbage on the screen and a spew of IOMMU faults in the host.  Even if
> I don't do the reserved/realloc thing, I get a single IOMMU fault in
> the host, which I assume is an access that occurs while the BAR is
> being sized and therefore disabled.  That would be a more realistic
> scenario of an install program not having native support for the
> GPU.  Darn, this PCI hack was growing on me.
> 
> I don't mind the allocation of guest memory issue, the fact that guest
> memory is consumed by built-in devices is exactly what happens on bare
> metal.

Right, I keep saying that.

> The only part I don't like about the BIOS assigning the GPA is
> that we need support in each BIOS to do that.

But then if we assign it on QEMU do we still need
each BIOS to read it?

>  I think we had already
> come up with a pretty good way to handle the opregion using
> etc/igd-opregion to fill the space, for stolen memory, I think we're
> just going to need to do something similar, maybe even via a dummy
> fw_cfg file so that SeaBIOS not only knows the size, but if we needed
> to pre-populate it, we could.  It at least puts QEMU in control, but
> relaying back the address via a device specific register is still a bit
> ugly.  Thanks,
> 
> Alex

I can build a generic interface to pass addresses
allocated by bios back to QEMU. It looks like this would
be useful for other purposes as well.  Interested?
Kevin O'Connor Feb. 15, 2016, 7:50 p.m. UTC | #15
On Mon, Feb 15, 2016 at 09:29:26PM +0200, Michael S. Tsirkin wrote:
> I can build a generic interface to pass addresses
> allocated by bios back to QEMU. It looks like this would
> be useful for other purposes as well.  Interested?

If this is undertaken, I suggest extending fw_cfg to support "writable
files" instead of introducing a whole new interface.

-Kevin
Alex Williamson Feb. 15, 2016, 7:54 p.m. UTC | #16
On Mon, 15 Feb 2016 21:29:26 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Feb 15, 2016 at 12:20:23PM -0700, Alex Williamson wrote:
> > On Mon, 15 Feb 2016 10:54:51 +0100
> > Igor Mammedov <imammedo@redhat.com> wrote:
> >   
> > > On Sat, 13 Feb 2016 18:03:31 -0700
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > > > On Sat, 13 Feb 2016 19:20:32 -0500
> > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:
> > > >     
> > > > > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:      
> > > > > > On Sat, 13 Feb 2016 15:05:09 -0500
> > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:        
> > > > > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:        
> > > > > > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:          
> > > > > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote:          
> > > > > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > > > > > "Kevin O'Connor" <kevin@koconnor.net> wrote:            
> > > > > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:            
> > > > > > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the
> > > > > > > > > > > > BIOS as a stolen memory range.  For the most part, guest drivers don't
> > > > > > > > > > > > make use of this, but our achilles heel is the vBIOS.  The vBIOS
> > > > > > > > > > > > programs the device to use the host stolen memory range and it's used
> > > > > > > > > > > > in the pre-boot environment.  Generally the guest won't have access to
> > > > > > > > > > > > the host stolen memory area, so these accesses either land in VM
> > > > > > > > > > > > memory or unassigned space and generate IOMMU faults.  By allocating
> > > > > > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via
> > > > > > > > > > > > vfio) can make sure this guest allocated stolen memory range is used
> > > > > > > > > > > > instead.              
> > > > > > > > > > > 
> > > > > > > > > > > What does "vBIOS" mean in this context?  Is it the video device option
> > > > > > > > > > > rom or something else?            
> > > > > > > > > > 
> > > > > > > > > > vBIOS = video BIOS, you're correct, it's just the video device option
> > > > > > > > > > ROM.            
> > > > > > > > > 
> > > > > > > > > Is the problem from when the host runs the video option rom, or is the
> > > > > > > > > problem from the guest (via SeaBIOS) running the video option rom?  If
> > > > > > > > > the guest is running the option rom, is it the first time the option
> > > > > > > > > rom has been run for the machine (ie, was the option rom not executed
> > > > > > > > > on the host when the host machine first booted)?
> > > > > > > > > 
> > > > > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a
> > > > > > > > > number of users have installed SeaBIOS (running natively) on their
> > > > > > > > > machines.  Running the intel video option rom more than once has been
> > > > > > > > > known to cause issues.          
> > > > > > > > 
> > > > > > > > The issue is in the VM and it occurs every time the option ROM is
> > > > > > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS version
> > > > > > > > string and boot menu), but any sort of extended graphics mode (ex. live
> > > > > > > > CD boot menu) tries to make use of the host memory area which
> > > > > > > > corresponds to the stolen memory area of the physical device.  We're
> > > > > > > > not really sure how the ROM execution arrives at these addresses (not
> > > > > > > > from the device according to access traces), but we can see when the
> > > > > > > > ROM is writing these addresses to the device and modify they addresses
> > > > > > > > to point at a VM memory range which we've allocated.  That's what this
> > > > > > > > code attempts to do, allocate a buffer and tell QEMU about it via the
> > > > > > > > BDSM (Base Data of Stolen Memory) register.          
> > > > > > > 
> > > > > > > Forgive me if I'm not fully understanding this.  If I read what you're
> > > > > > > saying then the sequence is something like:
> > > > > > > 
> > > > > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > > > > > >     base register at host system bootup time and reserves it in the
> > > > > > >     host e820 map.
> > > > > > > 
> > > > > > > 2 - upon running qemu, the guest reruns the vga bios option rom which
> > > > > > >     seems to work (ie, text mode works)
> > > > > > > 
> > > > > > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > > > > > >     the bootloader calls the vgabios to switch to graphics mode, and
> > > > > > >     the vgabios sends commands to the graphics hardware that somehow
> > > > > > >     reference the host stolen memory        
> > > > > > 
> > > > > > What exactly happens here isn't clear to me, but this is a plausible
> > > > > > explanation.  What we see in tracing access to the hardware is that a
> > > > > > bunch of addresses are written to the device that fall within the host
> > > > > > e820 reserved area and then the device starts generating IOMMU faults
> > > > > > trying to access those addresses.
> > > > > >         
> > > > > > > 4 - your patch causes QEMU to catch these commands with references to
> > > > > > >     the host stolen memory and replace them with references to the
> > > > > > >     guest stolen memory (which seabios creates)
> > > > > > > 
> > > > > > > Am I understanding the above correctly?        
> > > > > > 
> > > > > > Yes.
> > > > > >          
> > > > > > > Is the only reason to run the intel option rom in the guest for
> > > > > > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > > > > > hardware could fully emulate a legacy vga device - if the device is in
> > > > > > > vga compatibility mode then it may be possible to have seavgabios
> > > > > > > drive mode changes.        
> > > > > > 
> > > > > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > > > > > won't turn on without the vBIOS.        
> > > > > 
> > > > > This confuses me - why didn't the host system BIOS turn on the LCD
> > > > > panel during host bootup?      
> > > > 
> > > > It turns off when we reset the device between VM instances or between
> > > > VM boots.  IGD supports Function Level Reset (FLR).
> > > >     
> > > > > >Another desktop IvyBridge system
> > > > > > doesn't really care about the vBIOS so long as we don't ask it to
> > > > > > output anything before the guest native drivers are loaded.  If we
> > > > > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > > > > that's really not an option, it's going to run as a boot option ROM as
> > > > > > well, so we need to fix the issues that it generates there.        
> > > > > 
> > > > > From my experience with coreboot, running the vga option rom multiple
> > > > > times during a given boot is very fragile.  (By multiple times, I mean
> > > > > either the host running it and then a guest, or running it multiple
> > > > > times from multiple guests.)  YMMV.      
> > > > 
> > > > We do this regularly for graphics assignment, Nvidia, AMD, and now
> > > > Intel.  It generally works ok.  Perhaps you've seen issues with the
> > > > option ROM being run multiple times without resetting the device.  I
> > > > could certainly believe that.  We only have one blacklisted Broadcom
> > > > ROM in vfio, probably due to missing or incomplete device reset method.
> > > >      
> > > > > > > > > [...]          
> > > > > > > > > > The write to 0x5C is used by QEMU code that traps writes to the
> > > > > > > > > > device I/O port BAR and replaces the host stolen memory address
> > > > > > > > > > with the new guest address generated here.  0x5C is initialized to
> > > > > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > > > > > stolen memory and it will either overlap VM memory or an unassigned
> > > > > > > > > > area.  The former may corrupt VM memory, the latter throws host
> > > > > > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C hasn't
> > > > > > > > > > been programmed.            
> > > > > > > > > 
> > > > > > > > > So, if I understand correctly, 0x5C is not a "real" register on the
> > > > > > > > > hardware, but is instead just a mechanism to give QEMU the address of
> > > > > > > > > some guest visible ram?          
> > > > > > > > 
> > > > > > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > > > > > effectively into a scratch register.  On physical hardware the
> > > > > > > > register is read-only.
> > > > > > > >            
> > > > > > > > > BTW, is 0xFC a "real" register in the hardware?  How does the guest
> > > > > > > > > find the location of the "OpRegion" if SeaBIOS allocates it?          
> > > > > > > > 
> > > > > > > > 0xFC is the ASL Storage register, the guest finds the location of the
> > > > > > > > OpRegion using this register.  This is another register that is
> > > > > > > > read-only on real hardware but virtualized through vfio so we can
> > > > > > > > relocate the OpRegion into the VM address space.
> > > > > > > > 
> > > > > > > > I've found that allocating a dummy MMIO BAR does work as an alternative
> > > > > > > > for mapping space for this stolen memory into the VM address space.
> > > > > > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > > > > > Windows10 is not so happy with this, but does work if I allocate the
> > > > > > > > BAR on something like the ISA bridge device.  My guess is that the IGD
> > > > > > > > driver in Windows freaks out at finding this strange new BAR on its
> > > > > > > > device.  So I'll need to come up with an algorithm for either creating
> > > > > > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > > > > > existing devices.  It's certainly a more self-contained solution this
> > > > > > > > way, so I expect we'll only need patch 1/3 from this series.  Thanks,          
> > > > > > > 
> > > > > > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > > > > > 
> > > > > > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > > > > > then why not do the same with ASLS then too?        
> > > > > > 
> > > > > > I suppose we could do that.  There are a few nuances to the fake BAR
> > > > > > solution:
> > > > > > 
> > > > > > 1) The BAR needs to get mapped and not remapped while in use - usually
> > > > > >    not a problem.
> > > > > > 
> > > > > > 2) The guest needs to not disable the device we attach the BARs to,
> > > > > >    which it might do if it doesn't recognize the device.
> > > > > > 
> > > > > > 3) We need to be careful about adding BARs to devices the guest does
> > > > > >    have drivers for or we might overlap real functionality.
> > > > > > 
> > > > > > 4) If we create a dummy device with bogus IDs, it will show up with an
> > > > > >    exclamation mark in device manager, which makes people unhappy.
> > > > > > 
> > > > > > So from a perspective of being self contained, the fake BAR solution is
> > > > > > very good, but it's not without issue.  I'll try to think of what sort
> > > > > > of dummy device we could create that would always have a guest driver,
> > > > > > but nothing that a couple extra BARs would interfere with.  Maybe a
> > > > > > generic PCI bridge.  Thanks,        
> > > > > 
> > > > > Okay.  Again, I'm not stating a preferred direction.
> > > > > 
> > > > > BTW, I wonder if the recent discussion between Michael and Igor is
> > > > > relevant here:
> > > > >   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html      
> > > > 
> > > > I'm certainly open to rebuttals against this approach, but I do have it
> > > > working.  Being entirely self contained is pretty intriguing.
> > > > Theoretically this would allow us to work with OVMF with no
> > > > modifications.  Linux guests enable and disable devices several times
> > > > during boot (per the spec, any time the BAR is sized it should be
> > > > disabled), Windows never seems to disable the device.  The LPC/ISA
> > > > bridge seems to be the best place to put these BARs, we need to create
> > > > that anyway for pre-Broadwell/Skylake and the device itself has no
> > > > implemented BARs. The ISA bridge is just a shell device to keep the
> > > > driver happy on those older chips, so squatting on a couple BARs
> > > > doesn't seem too terrible. Thanks,    
> > > I'm not sure what way Windows would behave wrt rebalancing
> > > if that fake bar is added to LPC/ISA bridge and I can think of
> > > a way to test and verify it either.
> > > 
> > > In above discussion Michael dislikes that allocating GPA in QEMU,
> > > while I dislike stealing guest's RAM for just getting the same GPA.
> > > Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA
> > > without stealing guest's RAM somewhere in free address space and
> > > tell QEMU about it.
> > > Michael suggested to do something similar for stolen RAM in
> > > bios_linker interface only, which is a part of SeaBIOS/OVMF
> > > but maybe making it more generic PV interface might be even
> > > more useful. That way we won't have to implement fake PCI
> > > devices/BARs nor hijack existing PCI devices.  
> > 
> > Well, I was hoping that your requirements that the address cannot move
> > or be disabled was more strict that mine, but I think I've just proven
> > it's not.  I prevented the native drivers from loading in the guest
> > (pci-stubs.ids=8086:0126), reserved the memory that SeaBIOS was
> > programming for the ISA bridge BARs (memmap=2M#0xf1000000), told PCI to
> > reassign devices (pci=realloc), and booted with a VESA mode (vga=ask).
> > If I get all those things to align then VESA mode makes use of the
> > stolen memory assigned at vBIOS init and when the guest moves it we get
> > garbage on the screen and a spew of IOMMU faults in the host.  Even if
> > I don't do the reserved/realloc thing, I get a single IOMMU fault in
> > the host, which I assume is an access that occurs while the BAR is
> > being sized and therefore disabled.  That would be a more realistic
> > scenario of an install program not having native support for the
> > GPU.  Darn, this PCI hack was growing on me.
> > 
> > I don't mind the allocation of guest memory issue, the fact that guest
> > memory is consumed by built-in devices is exactly what happens on bare
> > metal.  
> 
> Right, I keep saying that.
> 
> > The only part I don't like about the BIOS assigning the GPA is
> > that we need support in each BIOS to do that.  
> 
> But then if we assign it on QEMU do we still need
> each BIOS to read it?

If QEMU allocated the GPA and passed an e820 reserved range, the BIOS
would have no reason to access or setup the space itself in my case.
Even if we wanted to refresh the contents every machine reset, we could
do that from QEMU.
 
> >  I think we had already
> > come up with a pretty good way to handle the opregion using
> > etc/igd-opregion to fill the space, for stolen memory, I think we're
> > just going to need to do something similar, maybe even via a dummy
> > fw_cfg file so that SeaBIOS not only knows the size, but if we needed
> > to pre-populate it, we could.  It at least puts QEMU in control, but
> > relaying back the address via a device specific register is still a bit
> > ugly.  Thanks,
> > 
> > Alex  
> 
> I can build a generic interface to pass addresses
> allocated by bios back to QEMU. It looks like this would
> be useful for other purposes as well.  Interested?

The only thing I don't like about the fw_cfg files solution is how we
get the resulting GPA back into QEMU.  For the OpRegion, we don't
really care unless we want to add support for remapping direct access
to the host OpRegion into the space the BIOS has allocated.  For stolen
memory, we do need access to that address in QEMU to complete the
quirk that fixes address written to hardware. The BIOS can write the
address back to the BDSM register on the device, but then we need to
worry if that register offset changes to other purposes in future
hardware and exposing something that maybe we'd rather not expose to
other guest drivers.

Long story short, yeah I wish there was a fw_cfg (or similar) path to
get that address, but we may just end up writing it to the BDSM
register anyway.  Thanks,

Alex
Michael S. Tsirkin Feb. 15, 2016, 8:08 p.m. UTC | #17
On Mon, Feb 15, 2016 at 02:50:41PM -0500, Kevin O'Connor wrote:
> On Mon, Feb 15, 2016 at 09:29:26PM +0200, Michael S. Tsirkin wrote:
> > I can build a generic interface to pass addresses
> > allocated by bios back to QEMU. It looks like this would
> > be useful for other purposes as well.  Interested?
> 
> If this is undertaken, I suggest extending fw_cfg to support "writable
> files" instead of introducing a whole new interface.
> 
> -Kevin

Exactly, this is what I had in mind.
Gerd Hoffmann Feb. 16, 2016, 12:08 p.m. UTC | #18
Hi,

> I don't mind the allocation of guest memory issue, the fact that guest
> memory is consumed by built-in devices is exactly what happens on bare
> metal.

And following in qemu what happens on bare metal usually works best
long-term.

> to pre-populate it, we could.  It at least puts QEMU in control, but
> relaying back the address via a device specific register is still a bit
> ugly.

It is the usual way.  We do the same for a few other addresses too
(mmconfig xbar, acpi registers).  Firmware programs the hardware, and
qemu picks up the addresses from there (and puts them into the generated
acpi tables for example).

Main advantage is that we don't need a paravirtual firmware <=> qemu
interface for each address.

Yes, each firmware (seabios/ovmf/...) needs to handle it then.  But if
we allocate the stolen memory from guest ram the firmware has to handle
it _anyway_ so it is marked reserved in e820 etc.

cheers,
  Gerd
diff mbox

Patch

diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
index 92170d5..c1ad5d4 100644
--- a/src/fw/pciinit.c
+++ b/src/fw/pciinit.c
@@ -260,7 +260,7 @@  static void ich9_smbus_setup(struct pci_device *dev, void *arg)
 static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
 {
     struct romfile_s *file = romfile_find("etc/igd-opregion");
-    void *opregion;
+    void *opregion, *bdsm;
     u16 bdf = dev->bdf;
 
     if (!file || !file->size)
@@ -281,6 +281,17 @@  static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
 
     dprintf(1, "Intel IGD OpRegion enabled on %02x:%02x.%x\n",
             pci_bdf_to_bus(bdf), pci_bdf_to_dev(bdf), pci_bdf_to_fn(bdf));
+
+    bdsm = memalign_high(1024 * 1024, 1024 * 1024);
+    if (!bdsm) {
+        warn_noalloc();
+        return;
+    }
+
+    pci_config_writel(bdf, 0x5C, cpu_to_le32((u32)bdsm));
+
+    dprintf(1, "Allocated 1MB reserved memory for Intel IGD stolen memory at "
+            "0x%08x\n", (u32)bdsm);
 }
 
 static const struct pci_device_id pci_device_tbl[] = {