Patchwork [4/5] Make MMIO address page aligned in guest.

login
register
mail settings
Submitter Gleb Natapov
Date Oct. 11, 2009, 6:59 p.m.
Message ID <1255287547-28329-4-git-send-email-gleb@redhat.com>
Download mbox | patch
Permalink /patch/35721/
State Not Applicable
Headers show

Comments

Gleb Natapov - Oct. 11, 2009, 6:59 p.m.
MMIO of some devices are not page aligned, such as some EHCI
controllers and virtual Realtek NIC in guest. Current guest
bios doesn't guarantee the start address of MMIO page aligned.
This may result in failure of device assignment, because KVM
only allow to register page aligned memory slots. For example,
it fails to assign EHCI controller (its MMIO size is 1KB) with
virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
the EHCI controller will starts from 0xf2001400.

MMIO addresses in guest are allocated in guest bios. This patch
makes MMIO address page aligned in bios, then fixes the issue.

qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 src/pciinit.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)
Michael S. Tsirkin - Oct. 11, 2009, 9:48 p.m.
On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> MMIO of some devices are not page aligned, such as some EHCI
> controllers and virtual Realtek NIC in guest. Current guest
> bios doesn't guarantee the start address of MMIO page aligned.
> This may result in failure of device assignment, because KVM
> only allow to register page aligned memory slots. For example,
> it fails to assign EHCI controller (its MMIO size is 1KB) with
> virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
> of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
> the EHCI controller will starts from 0xf2001400.
> 
> MMIO addresses in guest are allocated in guest bios. This patch
> makes MMIO address page aligned in bios, then fixes the issue.
> 
> qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>

This wastes memory for non-assigned devices.  I think it's better, and
cleaner, to make qemu increase the BAR size up to 4K for assigned
devices if it wants page size alignment.


> ---
>  src/pciinit.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/src/pciinit.c b/src/pciinit.c
> index 29b3901..53fbfcf 100644
> --- a/src/pciinit.c
> +++ b/src/pciinit.c
> @@ -10,6 +10,7 @@
>  #include "biosvar.h" // GET_EBDA
>  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
>  #include "pci_regs.h" // PCI_COMMAND
> +#include "paravirt.h"
>  
>  #define PCI_ROM_SLOT 6
>  #define PCI_NUM_REGIONS 7
> @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
>                  *paddr = ALIGN(*paddr, size);
>                  pci_set_io_region_addr(bdf, i, *paddr);
>                  *paddr += size;
> +                if (kvm_para_available()) {
> +                    /* make memory address page aligned */
> +                    /* needed for device assignment on kvm */
> +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> +               }
>              }
>          }
>          break;
> -- 
> 1.6.3.3
> 
>
Gleb Natapov - Oct. 12, 2009, 6:44 a.m.
On Sun, Oct 11, 2009 at 11:48:20PM +0200, Michael S. Tsirkin wrote:
> On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> > MMIO of some devices are not page aligned, such as some EHCI
> > controllers and virtual Realtek NIC in guest. Current guest
> > bios doesn't guarantee the start address of MMIO page aligned.
> > This may result in failure of device assignment, because KVM
> > only allow to register page aligned memory slots. For example,
> > it fails to assign EHCI controller (its MMIO size is 1KB) with
> > virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
> > of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
> > the EHCI controller will starts from 0xf2001400.
> > 
> > MMIO addresses in guest are allocated in guest bios. This patch
> > makes MMIO address page aligned in bios, then fixes the issue.
> > 
> > qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> 
> This wastes memory for non-assigned devices.  I think it's better, and
> cleaner, to make qemu increase the BAR size up to 4K for assigned
> devices if it wants page size alignment.
> 
We have three and a half devices in QEUM so I don't think memory is a
big concern. Regardless, if you think that fiddle with assigned devices
responses is better idea go ahead and send patches. As it stands this
patch is in kvm's bios and is required for assigned devices to work
for some devices, so moving to seabios without this patch will introduce
a regression.

> 
> > ---
> >  src/pciinit.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/src/pciinit.c b/src/pciinit.c
> > index 29b3901..53fbfcf 100644
> > --- a/src/pciinit.c
> > +++ b/src/pciinit.c
> > @@ -10,6 +10,7 @@
> >  #include "biosvar.h" // GET_EBDA
> >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> >  #include "pci_regs.h" // PCI_COMMAND
> > +#include "paravirt.h"
> >  
> >  #define PCI_ROM_SLOT 6
> >  #define PCI_NUM_REGIONS 7
> > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> >                  *paddr = ALIGN(*paddr, size);
> >                  pci_set_io_region_addr(bdf, i, *paddr);
> >                  *paddr += size;
> > +                if (kvm_para_available()) {
> > +                    /* make memory address page aligned */
> > +                    /* needed for device assignment on kvm */
> > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > +               }
> >              }
> >          }
> >          break;
> > -- 
> > 1.6.3.3
> > 
> > 

--
			Gleb.
Michael S. Tsirkin - Oct. 12, 2009, 7:10 a.m.
On Mon, Oct 12, 2009 at 08:44:33AM +0200, Gleb Natapov wrote:
> On Sun, Oct 11, 2009 at 11:48:20PM +0200, Michael S. Tsirkin wrote:
> > On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> > > MMIO of some devices are not page aligned, such as some EHCI
> > > controllers and virtual Realtek NIC in guest. Current guest
> > > bios doesn't guarantee the start address of MMIO page aligned.
> > > This may result in failure of device assignment, because KVM
> > > only allow to register page aligned memory slots. For example,
> > > it fails to assign EHCI controller (its MMIO size is 1KB) with
> > > virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
> > > of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
> > > the EHCI controller will starts from 0xf2001400.
> > > 
> > > MMIO addresses in guest are allocated in guest bios. This patch
> > > makes MMIO address page aligned in bios, then fixes the issue.
> > > 
> > > qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f
> > > 
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > 
> > This wastes memory for non-assigned devices.  I think it's better, and
> > cleaner, to make qemu increase the BAR size up to 4K for assigned
> > devices if it wants page size alignment.
> > 
> We have three and a half devices in QEUM so I don't think memory is a
> big concern. Regardless, if you think that fiddle with assigned devices
> responses is better idea go ahead and send patches.

Even if you fiddle with BIOS, guest is allowed to reassign BARs,
breaking your assumptions.

> As it stands this
> patch is in kvm's bios and is required for assigned devices to work
> for some devices, so moving to seabios without this patch will introduce
> a regression.

I have a question here: if kvm maps a full physical page
into guest memory, while device only uses part of the page,
won't that mean that guest is granted access outside the
device, which it should not have?
Maybe the solution is to disable bypass for sub-page BARs and to
handle them in qemu, where we don't have alignment restrictions?

> > 
> > > ---
> > >  src/pciinit.c |    7 +++++++
> > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/src/pciinit.c b/src/pciinit.c
> > > index 29b3901..53fbfcf 100644
> > > --- a/src/pciinit.c
> > > +++ b/src/pciinit.c
> > > @@ -10,6 +10,7 @@
> > >  #include "biosvar.h" // GET_EBDA
> > >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> > >  #include "pci_regs.h" // PCI_COMMAND
> > > +#include "paravirt.h"
> > >  
> > >  #define PCI_ROM_SLOT 6
> > >  #define PCI_NUM_REGIONS 7
> > > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> > >                  *paddr = ALIGN(*paddr, size);
> > >                  pci_set_io_region_addr(bdf, i, *paddr);
> > >                  *paddr += size;
> > > +                if (kvm_para_available()) {
> > > +                    /* make memory address page aligned */
> > > +                    /* needed for device assignment on kvm */
> > > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > > +               }
> > >              }
> > >          }
> > >          break;
> > > -- 
> > > 1.6.3.3
> > > 
> > > 
> 
> --
> 			Gleb.
Gleb Natapov - Oct. 12, 2009, 7:22 a.m.
On Mon, Oct 12, 2009 at 09:10:52AM +0200, Michael S. Tsirkin wrote:
> On Mon, Oct 12, 2009 at 08:44:33AM +0200, Gleb Natapov wrote:
> > On Sun, Oct 11, 2009 at 11:48:20PM +0200, Michael S. Tsirkin wrote:
> > > On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> > > > MMIO of some devices are not page aligned, such as some EHCI
> > > > controllers and virtual Realtek NIC in guest. Current guest
> > > > bios doesn't guarantee the start address of MMIO page aligned.
> > > > This may result in failure of device assignment, because KVM
> > > > only allow to register page aligned memory slots. For example,
> > > > it fails to assign EHCI controller (its MMIO size is 1KB) with
> > > > virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
> > > > of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
> > > > the EHCI controller will starts from 0xf2001400.
> > > > 
> > > > MMIO addresses in guest are allocated in guest bios. This patch
> > > > makes MMIO address page aligned in bios, then fixes the issue.
> > > > 
> > > > qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f
> > > > 
> > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > 
> > > This wastes memory for non-assigned devices.  I think it's better, and
> > > cleaner, to make qemu increase the BAR size up to 4K for assigned
> > > devices if it wants page size alignment.
> > > 
> > We have three and a half devices in QEUM so I don't think memory is a
> > big concern. Regardless, if you think that fiddle with assigned devices
> > responses is better idea go ahead and send patches.
> 
> Even if you fiddle with BIOS, guest is allowed to reassign BARs,
> breaking your assumptions.
Good point. So the fact that this patched helped its creator shows that
linux doesn't do this.

> > As it stands this
> > patch is in kvm's bios and is required for assigned devices to work
> > for some devices, so moving to seabios without this patch will introduce
> > a regression.
> 
> I have a question here: if kvm maps a full physical page
> into guest memory, while device only uses part of the page,
> won't that mean that guest is granted access outside the
> device, which it should not have?
And how is real HW different? It maps a full physical page into OS
memory even if BAR is smaller then page and grants OS access to
unassigned mmio region. Access unassigned mmio region shouldn't cause
any trouble, doesn't it?

> Maybe the solution is to disable bypass for sub-page BARs and to
> handle them in qemu, where we don't have alignment restrictions?
> 
Making fast path go through qemu for assigned devices? May be remove 
this pass through crap from kvm to save us all from this misery then? 

> > > 
> > > > ---
> > > >  src/pciinit.c |    7 +++++++
> > > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/src/pciinit.c b/src/pciinit.c
> > > > index 29b3901..53fbfcf 100644
> > > > --- a/src/pciinit.c
> > > > +++ b/src/pciinit.c
> > > > @@ -10,6 +10,7 @@
> > > >  #include "biosvar.h" // GET_EBDA
> > > >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> > > >  #include "pci_regs.h" // PCI_COMMAND
> > > > +#include "paravirt.h"
> > > >  
> > > >  #define PCI_ROM_SLOT 6
> > > >  #define PCI_NUM_REGIONS 7
> > > > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> > > >                  *paddr = ALIGN(*paddr, size);
> > > >                  pci_set_io_region_addr(bdf, i, *paddr);
> > > >                  *paddr += size;
> > > > +                if (kvm_para_available()) {
> > > > +                    /* make memory address page aligned */
> > > > +                    /* needed for device assignment on kvm */
> > > > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > > > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > > > +               }
> > > >              }
> > > >          }
> > > >          break;
> > > > -- 
> > > > 1.6.3.3
> > > > 
> > > > 
> > 
> > --
> > 			Gleb.

--
			Gleb.
Michael S. Tsirkin - Oct. 12, 2009, 8:13 a.m.
On Mon, Oct 12, 2009 at 09:22:02AM +0200, Gleb Natapov wrote:
> On Mon, Oct 12, 2009 at 09:10:52AM +0200, Michael S. Tsirkin wrote:
> > On Mon, Oct 12, 2009 at 08:44:33AM +0200, Gleb Natapov wrote:
> > > On Sun, Oct 11, 2009 at 11:48:20PM +0200, Michael S. Tsirkin wrote:
> > > > On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> > > > > MMIO of some devices are not page aligned, such as some EHCI
> > > > > controllers and virtual Realtek NIC in guest. Current guest
> > > > > bios doesn't guarantee the start address of MMIO page aligned.
> > > > > This may result in failure of device assignment, because KVM
> > > > > only allow to register page aligned memory slots. For example,
> > > > > it fails to assign EHCI controller (its MMIO size is 1KB) with
> > > > > virtual Realtek NIC (its MMIO size is 256Bytes), because MMIO
> > > > > of virtual Realtek NIC in guest starts from 0xf2001000, MMIO of
> > > > > the EHCI controller will starts from 0xf2001400.
> > > > > 
> > > > > MMIO addresses in guest are allocated in guest bios. This patch
> > > > > makes MMIO address page aligned in bios, then fixes the issue.
> > > > > 
> > > > > qemu-kvm commit ccc9b91a1fdfac4161461a519e10a233f5066d2f
> > > > > 
> > > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > 
> > > > This wastes memory for non-assigned devices.  I think it's better, and
> > > > cleaner, to make qemu increase the BAR size up to 4K for assigned
> > > > devices if it wants page size alignment.
> > > > 
> > > We have three and a half devices in QEUM so I don't think memory is a
> > > big concern. Regardless, if you think that fiddle with assigned devices
> > > responses is better idea go ahead and send patches.
> > 
> > Even if you fiddle with BIOS, guest is allowed to reassign BARs,
> > breaking your assumptions.
> Good point. So the fact that this patched helped its creator shows that
> linux doesn't do this.

Try hot-plugging the device instead of have it present on boot.
Patching BIOS won't help then, will it?  So my question is, if we need
to handle this in qemu, is it worth it to do it in kvm as well?

> > > As it stands this
> > > patch is in kvm's bios and is required for assigned devices to work
> > > for some devices, so moving to seabios without this patch will introduce
> > > a regression.
> > 
> > I have a question here: if kvm maps a full physical page
> > into guest memory, while device only uses part of the page,
> > won't that mean that guest is granted access outside the
> > device, which it should not have?
> And how is real HW different? It maps a full physical page into OS
> memory even if BAR is smaller then page and grants OS access to
> unassigned mmio region. Access unassigned mmio region shouldn't cause
> any trouble, doesn't it?

Unassigned - typically no, but there can be another device there, or a RAM
page.  It is different on real hardware where OS has access to all RAM and all
devices, anyway.

Here's an example from my laptop:

00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
        Subsystem: Lenovo Device 20e6
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
        Latency: 0
        Interrupt: pin A routed to IRQ 11
        Region 0: Memory at fc226800 (64-bit, non-prefetchable) [size=16]
        Capabilities: <access denied>

...

00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
        Subsystem: Lenovo Device 20f8
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 28
        Region 0: I/O ports at 1c48 [size=8]
        Region 1: I/O ports at 183c [size=4]
        Region 2: I/O ports at 1c40 [size=8]
        Region 3: I/O ports at 1838 [size=4]
        Region 4: I/O ports at 1c20 [size=32]
        Region 5: Memory at fc226000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: <access denied>
        Kernel driver in use: ahci

In this setup, if you assign a page at address fc226000, for SATA,
I think that guest will be able to control Communication controller as well.

> > Maybe the solution is to disable bypass for sub-page BARs and to
> > handle them in qemu, where we don't have alignment restrictions?
> > 
> Making fast path go through qemu for assigned devices? May be remove 
> this pass through crap from kvm to save us all from this misery then? 

Another option is for KVM to check these scenarious and deny assignment if
there's such an overlap.

> > > > 
> > > > > ---
> > > > >  src/pciinit.c |    7 +++++++
> > > > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > > > 
> > > > > diff --git a/src/pciinit.c b/src/pciinit.c
> > > > > index 29b3901..53fbfcf 100644
> > > > > --- a/src/pciinit.c
> > > > > +++ b/src/pciinit.c
> > > > > @@ -10,6 +10,7 @@
> > > > >  #include "biosvar.h" // GET_EBDA
> > > > >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> > > > >  #include "pci_regs.h" // PCI_COMMAND
> > > > > +#include "paravirt.h"
> > > > >  
> > > > >  #define PCI_ROM_SLOT 6
> > > > >  #define PCI_NUM_REGIONS 7
> > > > > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> > > > >                  *paddr = ALIGN(*paddr, size);
> > > > >                  pci_set_io_region_addr(bdf, i, *paddr);
> > > > >                  *paddr += size;
> > > > > +                if (kvm_para_available()) {
> > > > > +                    /* make memory address page aligned */
> > > > > +                    /* needed for device assignment on kvm */
> > > > > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > > > > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > > > > +               }
> > > > >              }
> > > > >          }
> > > > >          break;
> > > > > -- 
> > > > > 1.6.3.3
> > > > > 
> > > > > 
> > > 
> > > --
> > > 			Gleb.
> 
> --
> 			Gleb.
Gleb Natapov - Oct. 12, 2009, 8:48 a.m.
On Mon, Oct 12, 2009 at 10:13:14AM +0200, Michael S. Tsirkin wrote:
> > > > > 
> > > > > This wastes memory for non-assigned devices.  I think it's better, and
> > > > > cleaner, to make qemu increase the BAR size up to 4K for assigned
> > > > > devices if it wants page size alignment.
> > > > > 
> > > > We have three and a half devices in QEUM so I don't think memory is a
> > > > big concern. Regardless, if you think that fiddle with assigned devices
> > > > responses is better idea go ahead and send patches.
> > > 
> > > Even if you fiddle with BIOS, guest is allowed to reassign BARs,
> > > breaking your assumptions.
> > Good point. So the fact that this patched helped its creator shows that
> > linux doesn't do this.
> 
> Try hot-plugging the device instead of have it present on boot.
> Patching BIOS won't help then, will it?  So my question is, if we need
> to handle this in qemu, is it worth it to do it in kvm as well?
> 
It depend how linux assign mmio address to hot pluggable devices. How
can you be sure a device driver continue working if you'll misrepresent
BAR size BTW?

> > > > As it stands this
> > > > patch is in kvm's bios and is required for assigned devices to work
> > > > for some devices, so moving to seabios without this patch will introduce
> > > > a regression.
> > > 
> > > I have a question here: if kvm maps a full physical page
> > > into guest memory, while device only uses part of the page,
> > > won't that mean that guest is granted access outside the
> > > device, which it should not have?
> > And how is real HW different? It maps a full physical page into OS
> > memory even if BAR is smaller then page and grants OS access to
> > unassigned mmio region. Access unassigned mmio region shouldn't cause
> > any trouble, doesn't it?
> 
> Unassigned - typically no, but there can be another device there, or a RAM
> page.  It is different on real hardware where OS has access to all RAM and all
> devices, anyway.
> 
> Here's an example from my laptop:
> 
> 00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
>         Subsystem: Lenovo Device 20e6
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
>         Latency: 0
>         Interrupt: pin A routed to IRQ 11
>         Region 0: Memory at fc226800 (64-bit, non-prefetchable) [size=16]
>         Capabilities: <access denied>
> 
> ...
> 
> 00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
>         Subsystem: Lenovo Device 20f8
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin B routed to IRQ 28
>         Region 0: I/O ports at 1c48 [size=8]
>         Region 1: I/O ports at 183c [size=4]
>         Region 2: I/O ports at 1c40 [size=8]
>         Region 3: I/O ports at 1838 [size=4]
>         Region 4: I/O ports at 1c20 [size=32]
>         Region 5: Memory at fc226000 (32-bit, non-prefetchable) [size=2K]
>         Capabilities: <access denied>
>         Kernel driver in use: ahci
> 
> In this setup, if you assign a page at address fc226000, for SATA,
> I think that guest will be able to control Communication controller as well.
Who configures BARs for assigned device guest or host? If host you can't
safely passthrough one of those devices. But passthrough is not secure
anyway since guest can DMA all over host memory.

> 
> > > Maybe the solution is to disable bypass for sub-page BARs and to
> > > handle them in qemu, where we don't have alignment restrictions?
> > > 
> > Making fast path go through qemu for assigned devices? May be remove 
> > this pass through crap from kvm to save us all from this misery then? 
> 
> Another option is for KVM to check these scenarious and deny assignment if
> there's such an overlap.
One more constrain for device assignment. Simple real life scenarios
don't work for our users as it is. Adding more constrains will not help.

> 
> > > > > 
> > > > > > ---
> > > > > >  src/pciinit.c |    7 +++++++
> > > > > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > > > > 
> > > > > > diff --git a/src/pciinit.c b/src/pciinit.c
> > > > > > index 29b3901..53fbfcf 100644
> > > > > > --- a/src/pciinit.c
> > > > > > +++ b/src/pciinit.c
> > > > > > @@ -10,6 +10,7 @@
> > > > > >  #include "biosvar.h" // GET_EBDA
> > > > > >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> > > > > >  #include "pci_regs.h" // PCI_COMMAND
> > > > > > +#include "paravirt.h"
> > > > > >  
> > > > > >  #define PCI_ROM_SLOT 6
> > > > > >  #define PCI_NUM_REGIONS 7
> > > > > > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> > > > > >                  *paddr = ALIGN(*paddr, size);
> > > > > >                  pci_set_io_region_addr(bdf, i, *paddr);
> > > > > >                  *paddr += size;
> > > > > > +                if (kvm_para_available()) {
> > > > > > +                    /* make memory address page aligned */
> > > > > > +                    /* needed for device assignment on kvm */
> > > > > > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > > > > > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > > > > > +               }
> > > > > >              }
> > > > > >          }
> > > > > >          break;
> > > > > > -- 
> > > > > > 1.6.3.3
> > > > > > 
> > > > > > 
> > > > 
> > > > --
> > > > 			Gleb.
> > 
> > --
> > 			Gleb.

--
			Gleb.
Michael S. Tsirkin - Oct. 12, 2009, 9:43 a.m.
On Mon, Oct 12, 2009 at 10:48:58AM +0200, Gleb Natapov wrote:
> On Mon, Oct 12, 2009 at 10:13:14AM +0200, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > This wastes memory for non-assigned devices.  I think it's better, and
> > > > > > cleaner, to make qemu increase the BAR size up to 4K for assigned
> > > > > > devices if it wants page size alignment.
> > > > > > 
> > > > > We have three and a half devices in QEUM so I don't think memory is a
> > > > > big concern. Regardless, if you think that fiddle with assigned devices
> > > > > responses is better idea go ahead and send patches.
> > > > 
> > > > Even if you fiddle with BIOS, guest is allowed to reassign BARs,
> > > > breaking your assumptions.
> > > Good point. So the fact that this patched helped its creator shows that
> > > linux doesn't do this.
> > 
> > Try hot-plugging the device instead of have it present on boot.
> > Patching BIOS won't help then, will it?  So my question is, if we need
> > to handle this in qemu, is it worth it to do it in kvm as well?
> > 
> It depend how linux assign mmio address to hot pluggable devices. How
> can you be sure a device driver continue working if you'll misrepresent
> BAR size BTW?

Yes, this adds yet another way for device to discover it's running in a
VM, so this might break some drivers.  If we see many of these in
practice, we can try adding a PCI-to-PCI bridge with some dummy devices
behind it to the picture, to increase the chances to get a dedicated
memory page.

> > > > > As it stands this
> > > > > patch is in kvm's bios and is required for assigned devices to work
> > > > > for some devices, so moving to seabios without this patch will introduce
> > > > > a regression.
> > > > 
> > > > I have a question here: if kvm maps a full physical page
> > > > into guest memory, while device only uses part of the page,
> > > > won't that mean that guest is granted access outside the
> > > > device, which it should not have?
> > > And how is real HW different? It maps a full physical page into OS
> > > memory even if BAR is smaller then page and grants OS access to
> > > unassigned mmio region. Access unassigned mmio region shouldn't cause
> > > any trouble, doesn't it?
> > 
> > Unassigned - typically no, but there can be another device there, or a RAM
> > page.  It is different on real hardware where OS has access to all RAM and all
> > devices, anyway.
> > 
> > Here's an example from my laptop:
> > 
> > 00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
> >         Subsystem: Lenovo Device 20e6
> >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
> >         Latency: 0
> >         Interrupt: pin A routed to IRQ 11
> >         Region 0: Memory at fc226800 (64-bit, non-prefetchable) [size=16]
> >         Capabilities: <access denied>
> > 
> > ...
> > 
> > 00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
> >         Subsystem: Lenovo Device 20f8
> >         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> >         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Latency: 0
> >         Interrupt: pin B routed to IRQ 28
> >         Region 0: I/O ports at 1c48 [size=8]
> >         Region 1: I/O ports at 183c [size=4]
> >         Region 2: I/O ports at 1c40 [size=8]
> >         Region 3: I/O ports at 1838 [size=4]
> >         Region 4: I/O ports at 1c20 [size=32]
> >         Region 5: Memory at fc226000 (32-bit, non-prefetchable) [size=2K]
> >         Capabilities: <access denied>
> >         Kernel driver in use: ahci
> > 
> > In this setup, if you assign a page at address fc226000, for SATA,
> > I think that guest will be able to control Communication controller as well.
> Who configures BARs for assigned device guest or host?

Host.

> If host you can't safely passthrough one of those devices.

Why not?

> But passthrough is not secure anyway since guest can DMA all over host
> memory.


That's why we only enable it with I/O mmu, right?

> > 
> > > > Maybe the solution is to disable bypass for sub-page BARs and to
> > > > handle them in qemu, where we don't have alignment restrictions?
> > > > 
> > > Making fast path go through qemu for assigned devices? May be remove 
> > > this pass through crap from kvm to save us all from this misery then? 
> > 
> > Another option is for KVM to check these scenarious and deny assignment if
> > there's such an overlap.
> One more constrain for device assignment. Simple real life scenarios
> don't work for our users as it is. Adding more constrains will not help.

For linux host, you can force resource alignment using a kernel
parameter. What do you suggest? Ignore this issue?

> > 
> > > > > > 
> > > > > > > ---
> > > > > > >  src/pciinit.c |    7 +++++++
> > > > > > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/src/pciinit.c b/src/pciinit.c
> > > > > > > index 29b3901..53fbfcf 100644
> > > > > > > --- a/src/pciinit.c
> > > > > > > +++ b/src/pciinit.c
> > > > > > > @@ -10,6 +10,7 @@
> > > > > > >  #include "biosvar.h" // GET_EBDA
> > > > > > >  #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
> > > > > > >  #include "pci_regs.h" // PCI_COMMAND
> > > > > > > +#include "paravirt.h"
> > > > > > >  
> > > > > > >  #define PCI_ROM_SLOT 6
> > > > > > >  #define PCI_NUM_REGIONS 7
> > > > > > > @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
> > > > > > >                  *paddr = ALIGN(*paddr, size);
> > > > > > >                  pci_set_io_region_addr(bdf, i, *paddr);
> > > > > > >                  *paddr += size;
> > > > > > > +                if (kvm_para_available()) {
> > > > > > > +                    /* make memory address page aligned */
> > > > > > > +                    /* needed for device assignment on kvm */
> > > > > > > +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> > > > > > > +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> > > > > > > +               }
> > > > > > >              }
> > > > > > >          }
> > > > > > >          break;
> > > > > > > -- 
> > > > > > > 1.6.3.3
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > --
> > > > > 			Gleb.
> > > 
> > > --
> > > 			Gleb.
> 
> --
> 			Gleb.
Gleb Natapov - Oct. 12, 2009, 10:06 a.m.
On Mon, Oct 12, 2009 at 11:43:35AM +0200, Michael S. Tsirkin wrote:
> > > Try hot-plugging the device instead of have it present on boot.
> > > Patching BIOS won't help then, will it?  So my question is, if we need
> > > to handle this in qemu, is it worth it to do it in kvm as well?
> > > 
> > It depend how linux assign mmio address to hot pluggable devices. How
> > can you be sure a device driver continue working if you'll misrepresent
> > BAR size BTW?
> 
> Yes, this adds yet another way for device to discover it's running in a
> VM, so this might break some drivers.  If we see many of these in
> practice, we can try adding a PCI-to-PCI bridge with some dummy devices
> behind it to the picture, to increase the chances to get a dedicated
> memory page.
> 
Go ahead. But why all this churn instead of asking linux to align pci
resources and do the same in a bios.

> > > > > > As it stands this
> > > > > > patch is in kvm's bios and is required for assigned devices to work
> > > > > > for some devices, so moving to seabios without this patch will introduce
> > > > > > a regression.
> > > > > 
> > > > > I have a question here: if kvm maps a full physical page
> > > > > into guest memory, while device only uses part of the page,
> > > > > won't that mean that guest is granted access outside the
> > > > > device, which it should not have?
> > > > And how is real HW different? It maps a full physical page into OS
> > > > memory even if BAR is smaller then page and grants OS access to
> > > > unassigned mmio region. Access unassigned mmio region shouldn't cause
> > > > any trouble, doesn't it?
> > > 
> > > Unassigned - typically no, but there can be another device there, or a RAM
> > > page.  It is different on real hardware where OS has access to all RAM and all
> > > devices, anyway.
> > > 
> > > Here's an example from my laptop:
> > > 
> > > 00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
> > >         Subsystem: Lenovo Device 20e6
> > >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> > >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
> > >         Latency: 0
> > >         Interrupt: pin A routed to IRQ 11
> > >         Region 0: Memory at fc226800 (64-bit, non-prefetchable) [size=16]
> > >         Capabilities: <access denied>
> > > 
> > > ...
> > > 
> > > 00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
> > >         Subsystem: Lenovo Device 20f8
> > >         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> > >         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > >         Latency: 0
> > >         Interrupt: pin B routed to IRQ 28
> > >         Region 0: I/O ports at 1c48 [size=8]
> > >         Region 1: I/O ports at 183c [size=4]
> > >         Region 2: I/O ports at 1c40 [size=8]
> > >         Region 3: I/O ports at 1838 [size=4]
> > >         Region 4: I/O ports at 1c20 [size=32]
> > >         Region 5: Memory at fc226000 (32-bit, non-prefetchable) [size=2K]
> > >         Capabilities: <access denied>
> > >         Kernel driver in use: ahci
> > > 
> > > In this setup, if you assign a page at address fc226000, for SATA,
> > > I think that guest will be able to control Communication controller as well.
> > Who configures BARs for assigned device guest or host?
> 
> Host.
Nice, device assignment is broken in one more way.

> 
> > If host you can't safely passthrough one of those devices.
> 
> Why not?
> 
For the reason you stated. If you map fc226000-fc227000 to a guest it
can control device at fc226800.

> > But passthrough is not secure anyway since guest can DMA all over host
> > memory.
> 
> 
> That's why we only enable it with I/O mmu, right?
> 
We do? Not sure about it. It is absolutely required if you want
security of course, but not everyone care.

> > > 
> > > > > Maybe the solution is to disable bypass for sub-page BARs and to
> > > > > handle them in qemu, where we don't have alignment restrictions?
> > > > > 
> > > > Making fast path go through qemu for assigned devices? May be remove 
> > > > this pass through crap from kvm to save us all from this misery then? 
> > > 
> > > Another option is for KVM to check these scenarious and deny assignment if
> > > there's such an overlap.
> > One more constrain for device assignment. Simple real life scenarios
> > don't work for our users as it is. Adding more constrains will not help.
> 
> For linux host, you can force resource alignment using a kernel
> parameter. What do you suggest? Ignore this issue?
> 
I suggest forcing resource alignment in a host using a kernel parameter if you
care about security.

--
			Gleb.
Kevin O'Connor - Oct. 12, 2009, 2:27 p.m.
On Sun, Oct 11, 2009 at 08:59:06PM +0200, Gleb Natapov wrote:
> MMIO of some devices are not page aligned, such as some EHCI
[...]
> @@ -158,6 +159,12 @@ static void pci_bios_init_device(u16 bdf)
>                  *paddr = ALIGN(*paddr, size);
>                  pci_set_io_region_addr(bdf, i, *paddr);
>                  *paddr += size;
> +                if (kvm_para_available()) {
> +                    /* make memory address page aligned */
> +                    /* needed for device assignment on kvm */
> +                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
> +                        *paddr = (*paddr + 0xfff) & 0xfffff000;
> +               }
>              }

I don't see an issue with doing this.  However, I can't see why it
would be just done for kvm - why not do it for all hosts?

Also, please use the ALIGN() macro - something like:
      *paddr = ALIGN(*paddr, PAGE_SIZE);

-Kevin

Patch

diff --git a/src/pciinit.c b/src/pciinit.c
index 29b3901..53fbfcf 100644
--- a/src/pciinit.c
+++ b/src/pciinit.c
@@ -10,6 +10,7 @@ 
 #include "biosvar.h" // GET_EBDA
 #include "pci_ids.h" // PCI_VENDOR_ID_INTEL
 #include "pci_regs.h" // PCI_COMMAND
+#include "paravirt.h"
 
 #define PCI_ROM_SLOT 6
 #define PCI_NUM_REGIONS 7
@@ -158,6 +159,12 @@  static void pci_bios_init_device(u16 bdf)
                 *paddr = ALIGN(*paddr, size);
                 pci_set_io_region_addr(bdf, i, *paddr);
                 *paddr += size;
+                if (kvm_para_available()) {
+                    /* make memory address page aligned */
+                    /* needed for device assignment on kvm */
+                    if (!(val & PCI_BASE_ADDRESS_SPACE_IO))
+                        *paddr = (*paddr + 0xfff) & 0xfffff000;
+               }
             }
         }
         break;