diff mbox

[PULL,14/28] exec: make address spaces 64-bit wide

Message ID 1386786509-29966-14-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin Dec. 11, 2013, 6:30 p.m. UTC
From: Paolo Bonzini <pbonzini@redhat.com>

As an alternative to commit 818f86b (exec: limit system memory
size, 2013-11-04) let's just make all address spaces 64-bit wide.
This eliminates problems with phys_page_find ignoring bits above
TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
consequently messing up the computations.

In Luiz's reported crash, at startup gdb attempts to read from address
0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
is the newly introduced master abort region, which is as big as the PCI
address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
not 2^64.  But we get it anyway because phys_page_find ignores the upper
bits of the physical address.  In address_space_translate_internal then

    diff = int128_sub(section->mr->size, int128_make64(addr));
    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));

Comments

Alex Williamson Jan. 9, 2014, 5:24 p.m. UTC | #1
On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> As an alternative to commit 818f86b (exec: limit system memory
> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> This eliminates problems with phys_page_find ignoring bits above
> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> consequently messing up the computations.
> 
> In Luiz's reported crash, at startup gdb attempts to read from address
> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> is the newly introduced master abort region, which is as big as the PCI
> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> bits of the physical address.  In address_space_translate_internal then
> 
>     diff = int128_sub(section->mr->size, int128_make64(addr));
>     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> 
> diff becomes negative, and int128_get64 booms.
> 
> The size of the PCI address space region should be fixed anyway.
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  exec.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 7e5ce93..f907f5f 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>  
>  /* Size of the L2 (and L3, etc) page tables.  */
> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> +#define ADDR_SPACE_BITS 64
>  
>  #define P_L2_BITS 10
>  #define P_L2_SIZE (1 << P_L2_BITS)
> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>  {
>      system_memory = g_malloc(sizeof(*system_memory));
>  
> -    assert(ADDR_SPACE_BITS <= 64);
> -
> -    memory_region_init(system_memory, NULL, "system",
> -                       ADDR_SPACE_BITS == 64 ?
> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>      address_space_init(&address_space_memory, system_memory, "memory");
>  
>      system_io = g_malloc(sizeof(*system_io));

This seems to have some unexpected consequences around sizing 64bit PCI
BARs that I'm not sure how to handle.  After this patch I get vfio
traces like this:

vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
(save lower 32bits of BAR)
vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
(write mask to BAR)
vfio: region_del febe0000 - febe3fff
(memory region gets unmapped)
vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
(read size mask)
vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
(restore BAR)
vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
(memory region re-mapped)
vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
(save upper 32bits of BAR)
vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
(write mask to BAR)
vfio: region_del febe0000 - febe3fff
(memory region gets unmapped)
vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
(memory region gets re-mapped with new address)
qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
(iommu barfs because it can only handle 48bit physical addresses)

Prior to this change, there was no re-map with the fffffffffebe0000
address, presumably because it was beyond the address space of the PCI
window.  This address is clearly not in a PCI MMIO space, so why are we
allowing it to be realized in the system address space at this location?
Thanks,

Alex
Michael S. Tsirkin Jan. 9, 2014, 6 p.m. UTC | #2
On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > From: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > As an alternative to commit 818f86b (exec: limit system memory
> > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > This eliminates problems with phys_page_find ignoring bits above
> > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > consequently messing up the computations.
> > 
> > In Luiz's reported crash, at startup gdb attempts to read from address
> > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > is the newly introduced master abort region, which is as big as the PCI
> > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > bits of the physical address.  In address_space_translate_internal then
> > 
> >     diff = int128_sub(section->mr->size, int128_make64(addr));
> >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > 
> > diff becomes negative, and int128_get64 booms.
> > 
> > The size of the PCI address space region should be fixed anyway.
> > 
> > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  exec.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 7e5ce93..f907f5f 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >  
> >  /* Size of the L2 (and L3, etc) page tables.  */
> > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > +#define ADDR_SPACE_BITS 64
> >  
> >  #define P_L2_BITS 10
> >  #define P_L2_SIZE (1 << P_L2_BITS)
> > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >  {
> >      system_memory = g_malloc(sizeof(*system_memory));
> >  
> > -    assert(ADDR_SPACE_BITS <= 64);
> > -
> > -    memory_region_init(system_memory, NULL, "system",
> > -                       ADDR_SPACE_BITS == 64 ?
> > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >      address_space_init(&address_space_memory, system_memory, "memory");
> >  
> >      system_io = g_malloc(sizeof(*system_io));
> 
> This seems to have some unexpected consequences around sizing 64bit PCI
> BARs that I'm not sure how to handle.

BARs are often disabled during sizing. Maybe you
don't detect BAR being disabled?

>  After this patch I get vfio
> traces like this:
> 
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> (save lower 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> (write mask to BAR)
> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> (read size mask)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> (restore BAR)
> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> (memory region re-mapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> (save upper 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> (write mask to BAR)
> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> (memory region gets re-mapped with new address)
> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> (iommu barfs because it can only handle 48bit physical addresses)
> 

Why are you trying to program BAR addresses for dma in the iommu?

> Prior to this change, there was no re-map with the fffffffffebe0000
> address, presumably because it was beyond the address space of the PCI
> window.  This address is clearly not in a PCI MMIO space, so why are we
> allowing it to be realized in the system address space at this location?
> Thanks,
> 
> Alex

Why do you think it is not in PCI MMIO space?
True, CPU can't access this address but other pci devices can.
Alex Williamson Jan. 9, 2014, 6:47 p.m. UTC | #3
On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > 
> > > As an alternative to commit 818f86b (exec: limit system memory
> > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > This eliminates problems with phys_page_find ignoring bits above
> > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > consequently messing up the computations.
> > > 
> > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > is the newly introduced master abort region, which is as big as the PCI
> > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > bits of the physical address.  In address_space_translate_internal then
> > > 
> > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > 
> > > diff becomes negative, and int128_get64 booms.
> > > 
> > > The size of the PCI address space region should be fixed anyway.
> > > 
> > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  exec.c | 8 ++------
> > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/exec.c b/exec.c
> > > index 7e5ce93..f907f5f 100644
> > > --- a/exec.c
> > > +++ b/exec.c
> > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >  
> > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > +#define ADDR_SPACE_BITS 64
> > >  
> > >  #define P_L2_BITS 10
> > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >  {
> > >      system_memory = g_malloc(sizeof(*system_memory));
> > >  
> > > -    assert(ADDR_SPACE_BITS <= 64);
> > > -
> > > -    memory_region_init(system_memory, NULL, "system",
> > > -                       ADDR_SPACE_BITS == 64 ?
> > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >      address_space_init(&address_space_memory, system_memory, "memory");
> > >  
> > >      system_io = g_malloc(sizeof(*system_io));
> > 
> > This seems to have some unexpected consequences around sizing 64bit PCI
> > BARs that I'm not sure how to handle.
> 
> BARs are often disabled during sizing. Maybe you
> don't detect BAR being disabled?

See the trace below, the BARs are not disabled.  QEMU pci-core is doing
the sizing an memory region updates for the BARs, vfio is just a
pass-through here.

> >  After this patch I get vfio
> > traces like this:
> > 
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > (save lower 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > (write mask to BAR)
> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > (read size mask)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > (restore BAR)
> > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > (memory region re-mapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > (save upper 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > (write mask to BAR)
> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > (memory region gets re-mapped with new address)
> > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > (iommu barfs because it can only handle 48bit physical addresses)
> > 
> 
> Why are you trying to program BAR addresses for dma in the iommu?

Two reasons, first I can't tell the difference between RAM and MMIO.
Second, it enables peer-to-peer DMA between devices, which is something
that we might be able to take advantage of with GPU passthrough.

> > Prior to this change, there was no re-map with the fffffffffebe0000
> > address, presumably because it was beyond the address space of the PCI
> > window.  This address is clearly not in a PCI MMIO space, so why are we
> > allowing it to be realized in the system address space at this location?
> > Thanks,
> > 
> > Alex
> 
> Why do you think it is not in PCI MMIO space?
> True, CPU can't access this address but other pci devices can.

What happens on real hardware when an address like this is programmed to
a device?  The CPU doesn't have the physical bits to access it.  I have
serious doubts that another PCI device would be able to access it
either.  Maybe in some limited scenario where the devices are on the
same conventional PCI bus.  In the typical case, PCI addresses are
always limited by some kind of aperture, whether that's explicit in
bridge windows or implicit in hardware design (and perhaps made explicit
in ACPI).  Even if I wanted to filter these out as noise in vfio, how
would I do it in a way that still allows real 64bit MMIO to be
programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,

Alex
Alex Williamson Jan. 9, 2014, 7:03 p.m. UTC | #4
On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > 
> > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > This eliminates problems with phys_page_find ignoring bits above
> > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > consequently messing up the computations.
> > > > 
> > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > is the newly introduced master abort region, which is as big as the PCI
> > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > bits of the physical address.  In address_space_translate_internal then
> > > > 
> > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > 
> > > > diff becomes negative, and int128_get64 booms.
> > > > 
> > > > The size of the PCI address space region should be fixed anyway.
> > > > 
> > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  exec.c | 8 ++------
> > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/exec.c b/exec.c
> > > > index 7e5ce93..f907f5f 100644
> > > > --- a/exec.c
> > > > +++ b/exec.c
> > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >  
> > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > +#define ADDR_SPACE_BITS 64
> > > >  
> > > >  #define P_L2_BITS 10
> > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >  {
> > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > >  
> > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > -
> > > > -    memory_region_init(system_memory, NULL, "system",
> > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > >  
> > > >      system_io = g_malloc(sizeof(*system_io));
> > > 
> > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > BARs that I'm not sure how to handle.
> > 
> > BARs are often disabled during sizing. Maybe you
> > don't detect BAR being disabled?
> 
> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> the sizing an memory region updates for the BARs, vfio is just a
> pass-through here.

Sorry, not in the trace below, but yes the sizing seems to be happening
while I/O & memory are enabled int he command register.  Thanks,

Alex

> > >  After this patch I get vfio
> > > traces like this:
> > > 
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > (save lower 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > (read size mask)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > (restore BAR)
> > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > (memory region re-mapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > (save upper 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > (memory region gets re-mapped with new address)
> > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > (iommu barfs because it can only handle 48bit physical addresses)
> > > 
> > 
> > Why are you trying to program BAR addresses for dma in the iommu?
> 
> Two reasons, first I can't tell the difference between RAM and MMIO.
> Second, it enables peer-to-peer DMA between devices, which is something
> that we might be able to take advantage of with GPU passthrough.
> 
> > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > address, presumably because it was beyond the address space of the PCI
> > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > allowing it to be realized in the system address space at this location?
> > > Thanks,
> > > 
> > > Alex
> > 
> > Why do you think it is not in PCI MMIO space?
> > True, CPU can't access this address but other pci devices can.
> 
> What happens on real hardware when an address like this is programmed to
> a device?  The CPU doesn't have the physical bits to access it.  I have
> serious doubts that another PCI device would be able to access it
> either.  Maybe in some limited scenario where the devices are on the
> same conventional PCI bus.  In the typical case, PCI addresses are
> always limited by some kind of aperture, whether that's explicit in
> bridge windows or implicit in hardware design (and perhaps made explicit
> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> would I do it in a way that still allows real 64bit MMIO to be
> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> 
> Alex
Michael S. Tsirkin Jan. 9, 2014, 9:56 p.m. UTC | #5
On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > 
> > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > consequently messing up the computations.
> > > > > 
> > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > 
> > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > 
> > > > > diff becomes negative, and int128_get64 booms.
> > > > > 
> > > > > The size of the PCI address space region should be fixed anyway.
> > > > > 
> > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > ---
> > > > >  exec.c | 8 ++------
> > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/exec.c b/exec.c
> > > > > index 7e5ce93..f907f5f 100644
> > > > > --- a/exec.c
> > > > > +++ b/exec.c
> > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >  
> > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > +#define ADDR_SPACE_BITS 64
> > > > >  
> > > > >  #define P_L2_BITS 10
> > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >  {
> > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > >  
> > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > -
> > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > >  
> > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > 
> > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > BARs that I'm not sure how to handle.
> > > 
> > > BARs are often disabled during sizing. Maybe you
> > > don't detect BAR being disabled?
> > 
> > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > the sizing an memory region updates for the BARs, vfio is just a
> > pass-through here.
> 
> Sorry, not in the trace below, but yes the sizing seems to be happening
> while I/O & memory are enabled int he command register.  Thanks,
> 
> Alex

OK then from QEMU POV this BAR value is not special at all.

> > > >  After this patch I get vfio
> > > > traces like this:
> > > > 
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > (save lower 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > (read size mask)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > (restore BAR)
> > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > (memory region re-mapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > (save upper 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > (memory region gets re-mapped with new address)
> > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > 
> > > 
> > > Why are you trying to program BAR addresses for dma in the iommu?
> > 
> > Two reasons, first I can't tell the difference between RAM and MMIO.

Why can't you? Generally memory core let you find out easily.
But in this case it's vfio device itself that is sized so for sure you
know it's MMIO.
Maybe you will have same issue if there's another device with a 64 bit
bar though, like ivshmem?

> > Second, it enables peer-to-peer DMA between devices, which is something
> > that we might be able to take advantage of with GPU passthrough.
> > 
> > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > address, presumably because it was beyond the address space of the PCI
> > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > allowing it to be realized in the system address space at this location?
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > Why do you think it is not in PCI MMIO space?
> > > True, CPU can't access this address but other pci devices can.
> > 
> > What happens on real hardware when an address like this is programmed to
> > a device?  The CPU doesn't have the physical bits to access it.  I have
> > serious doubts that another PCI device would be able to access it
> > either.  Maybe in some limited scenario where the devices are on the
> > same conventional PCI bus.  In the typical case, PCI addresses are
> > always limited by some kind of aperture, whether that's explicit in
> > bridge windows or implicit in hardware design (and perhaps made explicit
> > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > would I do it in a way that still allows real 64bit MMIO to be
> > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > 
> > Alex
> 

AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
full 64 bit addresses must be allowed and hardware validation
test suites normally check that it actually does work
if it happens.

Yes, if there's a bridge somewhere on the path that bridge's
windows would protect you, but pci already does this filtering:
if you see this address in the memory map this means
your virtual device is on root bus.

So I think it's the other way around: if VFIO requires specific
address ranges to be assigned to devices, it should give this
info to qemu and qemu can give this to guest.
Then anything outside that range can be ignored by VFIO.
Alex Williamson Jan. 9, 2014, 10:42 p.m. UTC | #6
On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > 
> > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > consequently messing up the computations.
> > > > > > 
> > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > 
> > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > 
> > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > 
> > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > 
> > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > ---
> > > > > >  exec.c | 8 ++------
> > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/exec.c b/exec.c
> > > > > > index 7e5ce93..f907f5f 100644
> > > > > > --- a/exec.c
> > > > > > +++ b/exec.c
> > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > >  
> > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > +#define ADDR_SPACE_BITS 64
> > > > > >  
> > > > > >  #define P_L2_BITS 10
> > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > >  {
> > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > >  
> > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > -
> > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > >  
> > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > 
> > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > BARs that I'm not sure how to handle.
> > > > 
> > > > BARs are often disabled during sizing. Maybe you
> > > > don't detect BAR being disabled?
> > > 
> > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > the sizing an memory region updates for the BARs, vfio is just a
> > > pass-through here.
> > 
> > Sorry, not in the trace below, but yes the sizing seems to be happening
> > while I/O & memory are enabled int he command register.  Thanks,
> > 
> > Alex
> 
> OK then from QEMU POV this BAR value is not special at all.

Unfortunately

> > > > >  After this patch I get vfio
> > > > > traces like this:
> > > > > 
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > (save lower 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > (read size mask)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > (restore BAR)
> > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > (memory region re-mapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > (save upper 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > (memory region gets re-mapped with new address)
> > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > 
> > > > 
> > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > 
> > > Two reasons, first I can't tell the difference between RAM and MMIO.
> 
> Why can't you? Generally memory core let you find out easily.

My MemoryListener is setup for &address_space_memory and I then filter
out anything that's not memory_region_is_ram().  This still gets
through, so how do I easily find out?

> But in this case it's vfio device itself that is sized so for sure you
> know it's MMIO.

How so?  I have a MemoryListener as described above and pass everything
through to the IOMMU.  I suppose I could look through all the
VFIODevices and check if the MemoryRegion matches, but that seems really
ugly.

> Maybe you will have same issue if there's another device with a 64 bit
> bar though, like ivshmem?

Perhaps, I suspect I'll see anything that registers their BAR
MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.

> > > Second, it enables peer-to-peer DMA between devices, which is something
> > > that we might be able to take advantage of with GPU passthrough.
> > > 
> > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > address, presumably because it was beyond the address space of the PCI
> > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > allowing it to be realized in the system address space at this location?
> > > > > Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > Why do you think it is not in PCI MMIO space?
> > > > True, CPU can't access this address but other pci devices can.
> > > 
> > > What happens on real hardware when an address like this is programmed to
> > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > serious doubts that another PCI device would be able to access it
> > > either.  Maybe in some limited scenario where the devices are on the
> > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > always limited by some kind of aperture, whether that's explicit in
> > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > would I do it in a way that still allows real 64bit MMIO to be
> > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > 
> > > Alex
> > 
> 
> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> full 64 bit addresses must be allowed and hardware validation
> test suites normally check that it actually does work
> if it happens.

Sure, PCI devices themselves, but the chipset typically has defined
routing, that's more what I'm referring to.  There are generally only
fixed address windows for RAM vs MMIO.

> Yes, if there's a bridge somewhere on the path that bridge's
> windows would protect you, but pci already does this filtering:
> if you see this address in the memory map this means
> your virtual device is on root bus.
> 
> So I think it's the other way around: if VFIO requires specific
> address ranges to be assigned to devices, it should give this
> info to qemu and qemu can give this to guest.
> Then anything outside that range can be ignored by VFIO.

Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
currently no way to find out the address width of the IOMMU.  We've been
getting by because it's safely close enough to the CPU address width to
not be a concern until we start exposing things at the top of the 64bit
address space.  Maybe I can safely ignore anything above
TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,

Alex
Michael S. Tsirkin Jan. 10, 2014, 12:55 p.m. UTC | #7
On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > 
> > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > consequently messing up the computations.
> > > > > > > 
> > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > 
> > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > 
> > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > 
> > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > 
> > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > ---
> > > > > > >  exec.c | 8 ++------
> > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > --- a/exec.c
> > > > > > > +++ b/exec.c
> > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > >  
> > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > >  
> > > > > > >  #define P_L2_BITS 10
> > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > >  {
> > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > >  
> > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > -
> > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > >  
> > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > 
> > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > BARs that I'm not sure how to handle.
> > > > > 
> > > > > BARs are often disabled during sizing. Maybe you
> > > > > don't detect BAR being disabled?
> > > > 
> > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > pass-through here.
> > > 
> > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > while I/O & memory are enabled int he command register.  Thanks,
> > > 
> > > Alex
> > 
> > OK then from QEMU POV this BAR value is not special at all.
> 
> Unfortunately
> 
> > > > > >  After this patch I get vfio
> > > > > > traces like this:
> > > > > > 
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > (save lower 32bits of BAR)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > (write mask to BAR)
> > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > (memory region gets unmapped)
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > (read size mask)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > (restore BAR)
> > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > (memory region re-mapped)
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > (save upper 32bits of BAR)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > (write mask to BAR)
> > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > (memory region gets unmapped)
> > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > (memory region gets re-mapped with new address)
> > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > 
> > > > > 
> > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > 
> > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > 
> > Why can't you? Generally memory core let you find out easily.
> 
> My MemoryListener is setup for &address_space_memory and I then filter
> out anything that's not memory_region_is_ram().  This still gets
> through, so how do I easily find out?
> 
> > But in this case it's vfio device itself that is sized so for sure you
> > know it's MMIO.
> 
> How so?  I have a MemoryListener as described above and pass everything
> through to the IOMMU.  I suppose I could look through all the
> VFIODevices and check if the MemoryRegion matches, but that seems really
> ugly.
> 
> > Maybe you will have same issue if there's another device with a 64 bit
> > bar though, like ivshmem?
> 
> Perhaps, I suspect I'll see anything that registers their BAR
> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.

Must be a 64 bit BAR to trigger the issue though.

> > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > that we might be able to take advantage of with GPU passthrough.
> > > > 
> > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > allowing it to be realized in the system address space at this location?
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > Why do you think it is not in PCI MMIO space?
> > > > > True, CPU can't access this address but other pci devices can.
> > > > 
> > > > What happens on real hardware when an address like this is programmed to
> > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > serious doubts that another PCI device would be able to access it
> > > > either.  Maybe in some limited scenario where the devices are on the
> > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > always limited by some kind of aperture, whether that's explicit in
> > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > 
> > > > Alex
> > > 
> > 
> > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > full 64 bit addresses must be allowed and hardware validation
> > test suites normally check that it actually does work
> > if it happens.
> 
> Sure, PCI devices themselves, but the chipset typically has defined
> routing, that's more what I'm referring to.  There are generally only
> fixed address windows for RAM vs MMIO.

The physical chipset? Likely - in the presence of IOMMU.
Without that, devices can talk to each other without going
through chipset, and bridge spec is very explicit that
full 64 bit addressing must be supported.

So as long as we don't emulate an IOMMU,
guest will normally think it's okay to use any address.

> > Yes, if there's a bridge somewhere on the path that bridge's
> > windows would protect you, but pci already does this filtering:
> > if you see this address in the memory map this means
> > your virtual device is on root bus.
> > 
> > So I think it's the other way around: if VFIO requires specific
> > address ranges to be assigned to devices, it should give this
> > info to qemu and qemu can give this to guest.
> > Then anything outside that range can be ignored by VFIO.
> 
> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> currently no way to find out the address width of the IOMMU.  We've been
> getting by because it's safely close enough to the CPU address width to
> not be a concern until we start exposing things at the top of the 64bit
> address space.  Maybe I can safely ignore anything above
> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> 
> Alex

I think it's not related to target CPU at all - it's a host limitation.
So just make up your own constant, maybe depending on host architecture.
Long term add an ioctl to query it.

Also, we can add a fwcfg interface to tell bios that it should avoid
placing BARs above some address.

Since it's a vfio limitation I think it should be a vfio API, along the
lines of vfio_get_addr_space_bits(void).
(Is this true btw? legacy assignment doesn't have this problem?)

Something like this makes sense to you?
Alex Williamson Jan. 10, 2014, 3:31 p.m. UTC | #8
On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > 
> > > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > consequently messing up the computations.
> > > > > > > > 
> > > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > > 
> > > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > 
> > > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > > 
> > > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > > 
> > > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > ---
> > > > > > > >  exec.c | 8 ++------
> > > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > > --- a/exec.c
> > > > > > > > +++ b/exec.c
> > > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > >  
> > > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > > >  
> > > > > > > >  #define P_L2_BITS 10
> > > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > >  {
> > > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > >  
> > > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > -
> > > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > >  
> > > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > > 
> > > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > BARs that I'm not sure how to handle.
> > > > > > 
> > > > > > BARs are often disabled during sizing. Maybe you
> > > > > > don't detect BAR being disabled?
> > > > > 
> > > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > > pass-through here.
> > > > 
> > > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > while I/O & memory are enabled int he command register.  Thanks,
> > > > 
> > > > Alex
> > > 
> > > OK then from QEMU POV this BAR value is not special at all.
> > 
> > Unfortunately
> > 
> > > > > > >  After this patch I get vfio
> > > > > > > traces like this:
> > > > > > > 
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > (save lower 32bits of BAR)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > (write mask to BAR)
> > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > (memory region gets unmapped)
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > (read size mask)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > (restore BAR)
> > > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > (memory region re-mapped)
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > (save upper 32bits of BAR)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > (write mask to BAR)
> > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > (memory region gets unmapped)
> > > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > (memory region gets re-mapped with new address)
> > > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > 
> > > > > > 
> > > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > > 
> > > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > > 
> > > Why can't you? Generally memory core let you find out easily.
> > 
> > My MemoryListener is setup for &address_space_memory and I then filter
> > out anything that's not memory_region_is_ram().  This still gets
> > through, so how do I easily find out?
> > 
> > > But in this case it's vfio device itself that is sized so for sure you
> > > know it's MMIO.
> > 
> > How so?  I have a MemoryListener as described above and pass everything
> > through to the IOMMU.  I suppose I could look through all the
> > VFIODevices and check if the MemoryRegion matches, but that seems really
> > ugly.
> > 
> > > Maybe you will have same issue if there's another device with a 64 bit
> > > bar though, like ivshmem?
> > 
> > Perhaps, I suspect I'll see anything that registers their BAR
> > MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> 
> Must be a 64 bit BAR to trigger the issue though.
> 
> > > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > > that we might be able to take advantage of with GPU passthrough.
> > > > > 
> > > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > allowing it to be realized in the system address space at this location?
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > Why do you think it is not in PCI MMIO space?
> > > > > > True, CPU can't access this address but other pci devices can.
> > > > > 
> > > > > What happens on real hardware when an address like this is programmed to
> > > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > serious doubts that another PCI device would be able to access it
> > > > > either.  Maybe in some limited scenario where the devices are on the
> > > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > always limited by some kind of aperture, whether that's explicit in
> > > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > 
> > > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > full 64 bit addresses must be allowed and hardware validation
> > > test suites normally check that it actually does work
> > > if it happens.
> > 
> > Sure, PCI devices themselves, but the chipset typically has defined
> > routing, that's more what I'm referring to.  There are generally only
> > fixed address windows for RAM vs MMIO.
> 
> The physical chipset? Likely - in the presence of IOMMU.
> Without that, devices can talk to each other without going
> through chipset, and bridge spec is very explicit that
> full 64 bit addressing must be supported.
> 
> So as long as we don't emulate an IOMMU,
> guest will normally think it's okay to use any address.
> 
> > > Yes, if there's a bridge somewhere on the path that bridge's
> > > windows would protect you, but pci already does this filtering:
> > > if you see this address in the memory map this means
> > > your virtual device is on root bus.
> > > 
> > > So I think it's the other way around: if VFIO requires specific
> > > address ranges to be assigned to devices, it should give this
> > > info to qemu and qemu can give this to guest.
> > > Then anything outside that range can be ignored by VFIO.
> > 
> > Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > currently no way to find out the address width of the IOMMU.  We've been
> > getting by because it's safely close enough to the CPU address width to
> > not be a concern until we start exposing things at the top of the 64bit
> > address space.  Maybe I can safely ignore anything above
> > TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > 
> > Alex
> 
> I think it's not related to target CPU at all - it's a host limitation.
> So just make up your own constant, maybe depending on host architecture.
> Long term add an ioctl to query it.

It's a hardware limitation which I'd imagine has some loose ties to the
physical address bits of the CPU.

> Also, we can add a fwcfg interface to tell bios that it should avoid
> placing BARs above some address.

That doesn't help this case, it's a spurious mapping caused by sizing
the BARs with them enabled.  We may still want such a thing to feed into
building ACPI tables though.

> Since it's a vfio limitation I think it should be a vfio API, along the
> lines of vfio_get_addr_space_bits(void).
> (Is this true btw? legacy assignment doesn't have this problem?)

It's an IOMMU hardware limitation, legacy assignment has the same
problem.  It looks like legacy will abort() in QEMU for the failed
mapping and I'm planning to tighten vfio to also kill the VM for failed
mappings.  In the short term, I think I'll ignore any mappings above
TARGET_PHYS_ADDR_SPACE_BITS, long term vfio already has an IOMMU info
ioctl that we could use to return this information, but we'll need to
figure out how to get it out of the IOMMU driver first.  Thanks,

Alex
Michael S. Tsirkin Jan. 12, 2014, 7:54 a.m. UTC | #9
On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > 
> > > > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > > consequently messing up the computations.
> > > > > > > > > 
> > > > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > > > 
> > > > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > > 
> > > > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > > > 
> > > > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > > > 
> > > > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  exec.c | 8 ++------
> > > > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > > > --- a/exec.c
> > > > > > > > > +++ b/exec.c
> > > > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > > >  
> > > > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > > > >  
> > > > > > > > >  #define P_L2_BITS 10
> > > > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > > >  {
> > > > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > > >  
> > > > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > > -
> > > > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > > >  
> > > > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > > > 
> > > > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > > BARs that I'm not sure how to handle.
> > > > > > > 
> > > > > > > BARs are often disabled during sizing. Maybe you
> > > > > > > don't detect BAR being disabled?
> > > > > > 
> > > > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > > > pass-through here.
> > > > > 
> > > > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > while I/O & memory are enabled int he command register.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > OK then from QEMU POV this BAR value is not special at all.
> > > 
> > > Unfortunately
> > > 
> > > > > > > >  After this patch I get vfio
> > > > > > > > traces like this:
> > > > > > > > 
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > > (save lower 32bits of BAR)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > > (write mask to BAR)
> > > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > > (memory region gets unmapped)
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > > (read size mask)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > > (restore BAR)
> > > > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > > (memory region re-mapped)
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > > (save upper 32bits of BAR)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > > (write mask to BAR)
> > > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > > (memory region gets unmapped)
> > > > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > > (memory region gets re-mapped with new address)
> > > > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > > 
> > > > > > > 
> > > > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > 
> > > > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > 
> > > > Why can't you? Generally memory core let you find out easily.
> > > 
> > > My MemoryListener is setup for &address_space_memory and I then filter
> > > out anything that's not memory_region_is_ram().  This still gets
> > > through, so how do I easily find out?
> > > 
> > > > But in this case it's vfio device itself that is sized so for sure you
> > > > know it's MMIO.
> > > 
> > > How so?  I have a MemoryListener as described above and pass everything
> > > through to the IOMMU.  I suppose I could look through all the
> > > VFIODevices and check if the MemoryRegion matches, but that seems really
> > > ugly.
> > > 
> > > > Maybe you will have same issue if there's another device with a 64 bit
> > > > bar though, like ivshmem?
> > > 
> > > Perhaps, I suspect I'll see anything that registers their BAR
> > > MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > 
> > Must be a 64 bit BAR to trigger the issue though.
> > 
> > > > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > that we might be able to take advantage of with GPU passthrough.
> > > > > > 
> > > > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > > allowing it to be realized in the system address space at this location?
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > Why do you think it is not in PCI MMIO space?
> > > > > > > True, CPU can't access this address but other pci devices can.
> > > > > > 
> > > > > > What happens on real hardware when an address like this is programmed to
> > > > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > serious doubts that another PCI device would be able to access it
> > > > > > either.  Maybe in some limited scenario where the devices are on the
> > > > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > always limited by some kind of aperture, whether that's explicit in
> > > > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > 
> > > > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > full 64 bit addresses must be allowed and hardware validation
> > > > test suites normally check that it actually does work
> > > > if it happens.
> > > 
> > > Sure, PCI devices themselves, but the chipset typically has defined
> > > routing, that's more what I'm referring to.  There are generally only
> > > fixed address windows for RAM vs MMIO.
> > 
> > The physical chipset? Likely - in the presence of IOMMU.
> > Without that, devices can talk to each other without going
> > through chipset, and bridge spec is very explicit that
> > full 64 bit addressing must be supported.
> > 
> > So as long as we don't emulate an IOMMU,
> > guest will normally think it's okay to use any address.
> > 
> > > > Yes, if there's a bridge somewhere on the path that bridge's
> > > > windows would protect you, but pci already does this filtering:
> > > > if you see this address in the memory map this means
> > > > your virtual device is on root bus.
> > > > 
> > > > So I think it's the other way around: if VFIO requires specific
> > > > address ranges to be assigned to devices, it should give this
> > > > info to qemu and qemu can give this to guest.
> > > > Then anything outside that range can be ignored by VFIO.
> > > 
> > > Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > currently no way to find out the address width of the IOMMU.  We've been
> > > getting by because it's safely close enough to the CPU address width to
> > > not be a concern until we start exposing things at the top of the 64bit
> > > address space.  Maybe I can safely ignore anything above
> > > TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > 
> > > Alex
> > 
> > I think it's not related to target CPU at all - it's a host limitation.
> > So just make up your own constant, maybe depending on host architecture.
> > Long term add an ioctl to query it.
> 
> It's a hardware limitation which I'd imagine has some loose ties to the
> physical address bits of the CPU.
> 
> > Also, we can add a fwcfg interface to tell bios that it should avoid
> > placing BARs above some address.
> 
> That doesn't help this case, it's a spurious mapping caused by sizing
> the BARs with them enabled.  We may still want such a thing to feed into
> building ACPI tables though.

Well the point is that if you want BIOS to avoid
specific addresses, you need to tell it what to avoid.
But neither BIOS nor ACPI actually cover the range above
2^48 ATM so it's not a high priority.

> > Since it's a vfio limitation I think it should be a vfio API, along the
> > lines of vfio_get_addr_space_bits(void).
> > (Is this true btw? legacy assignment doesn't have this problem?)
> 
> It's an IOMMU hardware limitation, legacy assignment has the same
> problem.  It looks like legacy will abort() in QEMU for the failed
> mapping and I'm planning to tighten vfio to also kill the VM for failed
> mappings.  In the short term, I think I'll ignore any mappings above
> TARGET_PHYS_ADDR_SPACE_BITS,

That seems very wrong. It will still fail on an x86 host if we are
emulating a CPU with full 64 bit addressing. The limitation is on the
host side there's no real reason to tie it to the target.

> long term vfio already has an IOMMU info
> ioctl that we could use to return this information, but we'll need to
> figure out how to get it out of the IOMMU driver first.
>  Thanks,
> 
> Alex

Short term, just assume 48 bits on x86.

We need to figure out what's the limitation on ppc and arm -
maybe there's none and it can address full 64 bit range.

Cc some people who might know about these platforms.
Alexander Graf Jan. 12, 2014, 3:03 p.m. UTC | #10
On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:

> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>> 
>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>> 
>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>> 
>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>> 
>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>> 
>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>> 
>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>> --- a/exec.c
>>>>>>>>>> +++ b/exec.c
>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>> 
>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>> 
>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>> {
>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>> 
>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>> -
>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>> 
>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
>>>>>>>>> 
>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>> 
>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>> don't detect BAR being disabled?
>>>>>>> 
>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>> pass-through here.
>>>>>> 
>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>> 
>>>> Unfortunately
>>>> 
>>>>>>>>> After this patch I get vfio
>>>>>>>>> traces like this:
>>>>>>>>> 
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>> (write mask to BAR)
>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>> (memory region gets unmapped)
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>> (read size mask)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>> (restore BAR)
>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>> (memory region re-mapped)
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>> (write mask to BAR)
>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>> (memory region gets unmapped)
>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>> 
>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>> 
>>>>> Why can't you? Generally memory core let you find out easily.
>>>> 
>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>> out anything that's not memory_region_is_ram().  This still gets
>>>> through, so how do I easily find out?
>>>> 
>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>> know it's MMIO.
>>>> 
>>>> How so?  I have a MemoryListener as described above and pass everything
>>>> through to the IOMMU.  I suppose I could look through all the
>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>> ugly.
>>>> 
>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>> bar though, like ivshmem?
>>>> 
>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>> 
>>> Must be a 64 bit BAR to trigger the issue though.
>>> 
>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>> 
>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>> 
>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>> 
>>>>>>> Alex
>>>>>> 
>>>>> 
>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>> test suites normally check that it actually does work
>>>>> if it happens.
>>>> 
>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>> routing, that's more what I'm referring to.  There are generally only
>>>> fixed address windows for RAM vs MMIO.
>>> 
>>> The physical chipset? Likely - in the presence of IOMMU.
>>> Without that, devices can talk to each other without going
>>> through chipset, and bridge spec is very explicit that
>>> full 64 bit addressing must be supported.
>>> 
>>> So as long as we don't emulate an IOMMU,
>>> guest will normally think it's okay to use any address.
>>> 
>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>> windows would protect you, but pci already does this filtering:
>>>>> if you see this address in the memory map this means
>>>>> your virtual device is on root bus.
>>>>> 
>>>>> So I think it's the other way around: if VFIO requires specific
>>>>> address ranges to be assigned to devices, it should give this
>>>>> info to qemu and qemu can give this to guest.
>>>>> Then anything outside that range can be ignored by VFIO.
>>>> 
>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>> getting by because it's safely close enough to the CPU address width to
>>>> not be a concern until we start exposing things at the top of the 64bit
>>>> address space.  Maybe I can safely ignore anything above
>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>> 
>>>> Alex
>>> 
>>> I think it's not related to target CPU at all - it's a host limitation.
>>> So just make up your own constant, maybe depending on host architecture.
>>> Long term add an ioctl to query it.
>> 
>> It's a hardware limitation which I'd imagine has some loose ties to the
>> physical address bits of the CPU.
>> 
>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>> placing BARs above some address.
>> 
>> That doesn't help this case, it's a spurious mapping caused by sizing
>> the BARs with them enabled.  We may still want such a thing to feed into
>> building ACPI tables though.
> 
> Well the point is that if you want BIOS to avoid
> specific addresses, you need to tell it what to avoid.
> But neither BIOS nor ACPI actually cover the range above
> 2^48 ATM so it's not a high priority.
> 
>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>> lines of vfio_get_addr_space_bits(void).
>>> (Is this true btw? legacy assignment doesn't have this problem?)
>> 
>> It's an IOMMU hardware limitation, legacy assignment has the same
>> problem.  It looks like legacy will abort() in QEMU for the failed
>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>> mappings.  In the short term, I think I'll ignore any mappings above
>> TARGET_PHYS_ADDR_SPACE_BITS,
> 
> That seems very wrong. It will still fail on an x86 host if we are
> emulating a CPU with full 64 bit addressing. The limitation is on the
> host side there's no real reason to tie it to the target.
> 
>> long term vfio already has an IOMMU info
>> ioctl that we could use to return this information, but we'll need to
>> figure out how to get it out of the IOMMU driver first.
>> Thanks,
>> 
>> Alex
> 
> Short term, just assume 48 bits on x86.
> 
> We need to figure out what's the limitation on ppc and arm -
> maybe there's none and it can address full 64 bit range.

IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.

Or did I misunderstand the question?


Alex
Alex Williamson Jan. 13, 2014, 9:39 p.m. UTC | #11
On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>> 
> >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>> 
> >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>> 
> >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>> 
> >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>> 
> >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>> 
> >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>> 
> >>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>> --- a/exec.c
> >>>>>>>>>> +++ b/exec.c
> >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>> 
> >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>> 
> >>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>> {
> >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>> 
> >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>> -
> >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>> 
> >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>> 
> >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>> 
> >>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>> don't detect BAR being disabled?
> >>>>>>> 
> >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>> pass-through here.
> >>>>>> 
> >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>> 
> >>>> Unfortunately
> >>>> 
> >>>>>>>>> After this patch I get vfio
> >>>>>>>>> traces like this:
> >>>>>>>>> 
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>> (read size mask)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>> (restore BAR)
> >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region re-mapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>> 
> >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>> 
> >>>>> Why can't you? Generally memory core let you find out easily.
> >>>> 
> >>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>> out anything that's not memory_region_is_ram().  This still gets
> >>>> through, so how do I easily find out?
> >>>> 
> >>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>> know it's MMIO.
> >>>> 
> >>>> How so?  I have a MemoryListener as described above and pass everything
> >>>> through to the IOMMU.  I suppose I could look through all the
> >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>> ugly.
> >>>> 
> >>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>> bar though, like ivshmem?
> >>>> 
> >>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>> 
> >>> Must be a 64 bit BAR to trigger the issue though.
> >>> 
> >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>> 
> >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>> Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>>> 
> >>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>> 
> >>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>> 
> >>>>>>> Alex
> >>>>>> 
> >>>>> 
> >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>> test suites normally check that it actually does work
> >>>>> if it happens.
> >>>> 
> >>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>> routing, that's more what I'm referring to.  There are generally only
> >>>> fixed address windows for RAM vs MMIO.
> >>> 
> >>> The physical chipset? Likely - in the presence of IOMMU.
> >>> Without that, devices can talk to each other without going
> >>> through chipset, and bridge spec is very explicit that
> >>> full 64 bit addressing must be supported.
> >>> 
> >>> So as long as we don't emulate an IOMMU,
> >>> guest will normally think it's okay to use any address.
> >>> 
> >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>> windows would protect you, but pci already does this filtering:
> >>>>> if you see this address in the memory map this means
> >>>>> your virtual device is on root bus.
> >>>>> 
> >>>>> So I think it's the other way around: if VFIO requires specific
> >>>>> address ranges to be assigned to devices, it should give this
> >>>>> info to qemu and qemu can give this to guest.
> >>>>> Then anything outside that range can be ignored by VFIO.
> >>>> 
> >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>> getting by because it's safely close enough to the CPU address width to
> >>>> not be a concern until we start exposing things at the top of the 64bit
> >>>> address space.  Maybe I can safely ignore anything above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> I think it's not related to target CPU at all - it's a host limitation.
> >>> So just make up your own constant, maybe depending on host architecture.
> >>> Long term add an ioctl to query it.
> >> 
> >> It's a hardware limitation which I'd imagine has some loose ties to the
> >> physical address bits of the CPU.
> >> 
> >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>> placing BARs above some address.
> >> 
> >> That doesn't help this case, it's a spurious mapping caused by sizing
> >> the BARs with them enabled.  We may still want such a thing to feed into
> >> building ACPI tables though.
> > 
> > Well the point is that if you want BIOS to avoid
> > specific addresses, you need to tell it what to avoid.
> > But neither BIOS nor ACPI actually cover the range above
> > 2^48 ATM so it's not a high priority.
> > 
> >>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>> lines of vfio_get_addr_space_bits(void).
> >>> (Is this true btw? legacy assignment doesn't have this problem?)
> >> 
> >> It's an IOMMU hardware limitation, legacy assignment has the same
> >> problem.  It looks like legacy will abort() in QEMU for the failed
> >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >> mappings.  In the short term, I think I'll ignore any mappings above
> >> TARGET_PHYS_ADDR_SPACE_BITS,
> > 
> > That seems very wrong. It will still fail on an x86 host if we are
> > emulating a CPU with full 64 bit addressing. The limitation is on the
> > host side there's no real reason to tie it to the target.

I doubt vfio would be the only thing broken in that case.

> >> long term vfio already has an IOMMU info
> >> ioctl that we could use to return this information, but we'll need to
> >> figure out how to get it out of the IOMMU driver first.
> >> Thanks,
> >> 
> >> Alex
> > 
> > Short term, just assume 48 bits on x86.

I hate to pick an arbitrary value since we have a very specific mapping
we're trying to avoid.  Perhaps a better option is to skip anything
where:

        MemoryRegionSection.offset_within_address_space >
        ~MemoryRegionSection.offset_within_address_space

> > We need to figure out what's the limitation on ppc and arm -
> > maybe there's none and it can address full 64 bit range.
> 
> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> 
> Or did I misunderstand the question?

Sounds right, if either BAR mappings outside the window will not be
realized in the memory space or the IOMMU has a full 64bit address
space, there's no problem.  Here we have an intermediate step in the BAR
sizing producing a stray mapping that the IOMMU hardware can't handle.
Even if we could handle it, it's not clear that we want to.  On AMD-Vi
the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
this then causes space and time overhead until the tables are pruned
back down.  Thanks,

Alex
Alexander Graf Jan. 13, 2014, 9:48 p.m. UTC | #12
> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> 
>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>> 
>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>> 
>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>> 
>>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>> 
>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>> 
>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>> 
>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>> 
>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>> 
>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>> 
>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>> 
>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>> -
>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>> 
>>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>> 
>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>> 
>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>> 
>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>> pass-through here.
>>>>>>>> 
>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>> 
>>>>>>>> Alex
>>>>>>> 
>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>> 
>>>>>> Unfortunately
>>>>>> 
>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>> traces like this:
>>>>>>>>>>> 
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>> (read size mask)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>> (restore BAR)
>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>> 
>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>> 
>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>> 
>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>> 
>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>> through, so how do I easily find out?
>>>>>> 
>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>> know it's MMIO.
>>>>>> 
>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>> ugly.
>>>>>> 
>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>> bar though, like ivshmem?
>>>>>> 
>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>> 
>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>> 
>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>> 
>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Alex
>>>>>>>>>> 
>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>> 
>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>> 
>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>> test suites normally check that it actually does work
>>>>>>> if it happens.
>>>>>> 
>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>> fixed address windows for RAM vs MMIO.
>>>>> 
>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>> Without that, devices can talk to each other without going
>>>>> through chipset, and bridge spec is very explicit that
>>>>> full 64 bit addressing must be supported.
>>>>> 
>>>>> So as long as we don't emulate an IOMMU,
>>>>> guest will normally think it's okay to use any address.
>>>>> 
>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>> if you see this address in the memory map this means
>>>>>>> your virtual device is on root bus.
>>>>>>> 
>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>> 
>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>> Long term add an ioctl to query it.
>>>> 
>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>> physical address bits of the CPU.
>>>> 
>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>> placing BARs above some address.
>>>> 
>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>> building ACPI tables though.
>>> 
>>> Well the point is that if you want BIOS to avoid
>>> specific addresses, you need to tell it what to avoid.
>>> But neither BIOS nor ACPI actually cover the range above
>>> 2^48 ATM so it's not a high priority.
>>> 
>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>> lines of vfio_get_addr_space_bits(void).
>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>> 
>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>> 
>>> That seems very wrong. It will still fail on an x86 host if we are
>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>> host side there's no real reason to tie it to the target.
> 
> I doubt vfio would be the only thing broken in that case.
> 
>>>> long term vfio already has an IOMMU info
>>>> ioctl that we could use to return this information, but we'll need to
>>>> figure out how to get it out of the IOMMU driver first.
>>>> Thanks,
>>>> 
>>>> Alex
>>> 
>>> Short term, just assume 48 bits on x86.
> 
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid.  Perhaps a better option is to skip anything
> where:
> 
>        MemoryRegionSection.offset_within_address_space >
>        ~MemoryRegionSection.offset_within_address_space
> 
>>> We need to figure out what's the limitation on ppc and arm -
>>> maybe there's none and it can address full 64 bit range.
>> 
>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>> 
>> Or did I misunderstand the question?
> 
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem.  Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down.  Thanks,

I thought sizing is hard defined as a set to
-1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?

Alex

> 
> Alex
>
Alex Williamson Jan. 13, 2014, 10:48 p.m. UTC | #13
On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> 
> > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > 
> >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>> 
> >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>> 
> >>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>> -
> >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>> 
> >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>> 
> >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>> 
> >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>> 
> >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>> pass-through here.
> >>>>>>>> 
> >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>> 
> >>>>>> Unfortunately
> >>>>>> 
> >>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>> traces like this:
> >>>>>>>>>>> 
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>> (read size mask)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>> (restore BAR)
> >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>> 
> >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>> 
> >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>> 
> >>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>> 
> >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>> through, so how do I easily find out?
> >>>>>> 
> >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>> know it's MMIO.
> >>>>>> 
> >>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>> ugly.
> >>>>>> 
> >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>> bar though, like ivshmem?
> >>>>>> 
> >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>> 
> >>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>> 
> >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>> 
> >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>>> 
> >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>> 
> >>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>> 
> >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>> test suites normally check that it actually does work
> >>>>>>> if it happens.
> >>>>>> 
> >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>> fixed address windows for RAM vs MMIO.
> >>>>> 
> >>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>> Without that, devices can talk to each other without going
> >>>>> through chipset, and bridge spec is very explicit that
> >>>>> full 64 bit addressing must be supported.
> >>>>> 
> >>>>> So as long as we don't emulate an IOMMU,
> >>>>> guest will normally think it's okay to use any address.
> >>>>> 
> >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>> if you see this address in the memory map this means
> >>>>>>> your virtual device is on root bus.
> >>>>>>> 
> >>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>> 
> >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>> Long term add an ioctl to query it.
> >>>> 
> >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>> physical address bits of the CPU.
> >>>> 
> >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>> placing BARs above some address.
> >>>> 
> >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>> building ACPI tables though.
> >>> 
> >>> Well the point is that if you want BIOS to avoid
> >>> specific addresses, you need to tell it what to avoid.
> >>> But neither BIOS nor ACPI actually cover the range above
> >>> 2^48 ATM so it's not a high priority.
> >>> 
> >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>> lines of vfio_get_addr_space_bits(void).
> >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>> 
> >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>> 
> >>> That seems very wrong. It will still fail on an x86 host if we are
> >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>> host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> > 
> >>>> long term vfio already has an IOMMU info
> >>>> ioctl that we could use to return this information, but we'll need to
> >>>> figure out how to get it out of the IOMMU driver first.
> >>>> Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.  Perhaps a better option is to skip anything
> > where:
> > 
> >        MemoryRegionSection.offset_within_address_space >
> >        ~MemoryRegionSection.offset_within_address_space
> > 
> >>> We need to figure out what's the limitation on ppc and arm -
> >>> maybe there's none and it can address full 64 bit range.
> >> 
> >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >> 
> >> Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> 
> I thought sizing is hard defined as a set to
> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?

PCI doesn't want to handle this as anything special to differentiate a
sizing mask from a valid BAR address.  I agree though, I'd prefer to
never see a spurious address like this in my MemoryListener.
Michael S. Tsirkin Jan. 14, 2014, 8:18 a.m. UTC | #14
On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> 
> 
> > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > 
> >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>> 
> >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>> 
> >>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>> -
> >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>> 
> >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>> 
> >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>> 
> >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>> 
> >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>> pass-through here.
> >>>>>>>> 
> >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>> 
> >>>>>> Unfortunately
> >>>>>> 
> >>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>> traces like this:
> >>>>>>>>>>> 
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>> (read size mask)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>> (restore BAR)
> >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>> 
> >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>> 
> >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>> 
> >>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>> 
> >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>> through, so how do I easily find out?
> >>>>>> 
> >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>> know it's MMIO.
> >>>>>> 
> >>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>> ugly.
> >>>>>> 
> >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>> bar though, like ivshmem?
> >>>>>> 
> >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>> 
> >>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>> 
> >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>> 
> >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>>> 
> >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>> 
> >>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>> 
> >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>> test suites normally check that it actually does work
> >>>>>>> if it happens.
> >>>>>> 
> >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>> fixed address windows for RAM vs MMIO.
> >>>>> 
> >>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>> Without that, devices can talk to each other without going
> >>>>> through chipset, and bridge spec is very explicit that
> >>>>> full 64 bit addressing must be supported.
> >>>>> 
> >>>>> So as long as we don't emulate an IOMMU,
> >>>>> guest will normally think it's okay to use any address.
> >>>>> 
> >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>> if you see this address in the memory map this means
> >>>>>>> your virtual device is on root bus.
> >>>>>>> 
> >>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>> 
> >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>> Long term add an ioctl to query it.
> >>>> 
> >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>> physical address bits of the CPU.
> >>>> 
> >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>> placing BARs above some address.
> >>>> 
> >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>> building ACPI tables though.
> >>> 
> >>> Well the point is that if you want BIOS to avoid
> >>> specific addresses, you need to tell it what to avoid.
> >>> But neither BIOS nor ACPI actually cover the range above
> >>> 2^48 ATM so it's not a high priority.
> >>> 
> >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>> lines of vfio_get_addr_space_bits(void).
> >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>> 
> >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>> 
> >>> That seems very wrong. It will still fail on an x86 host if we are
> >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>> host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> > 
> >>>> long term vfio already has an IOMMU info
> >>>> ioctl that we could use to return this information, but we'll need to
> >>>> figure out how to get it out of the IOMMU driver first.
> >>>> Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.  Perhaps a better option is to skip anything
> > where:
> > 
> >        MemoryRegionSection.offset_within_address_space >
> >        ~MemoryRegionSection.offset_within_address_space
> > 
> >>> We need to figure out what's the limitation on ppc and arm -
> >>> maybe there's none and it can address full 64 bit range.
> >> 
> >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >> 
> >> Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> 
> I thought sizing is hard defined as a set to
> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> 
> Alex

We already have a work-around like this and it works for 32 bit BARs
or after software writes the full 64 register:
    if (last_addr <= new_addr || new_addr == 0 ||
        last_addr == PCI_BAR_UNMAPPED) {
        return PCI_BAR_UNMAPPED;
    }

    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
        return PCI_BAR_UNMAPPED;
    }


But for 64 bit BARs this software writes all 1's
in the high 32 bit register before writing in the low register
(see trace above).
This makes it impossible to distinguish between
setting bar at fffffffffebe0000 and this intermediate sizing step.


> > 
> > Alex
> >
Alexander Graf Jan. 14, 2014, 9:20 a.m. UTC | #15
On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
>> 
>> 
>>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
>>> 
>>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>>> 
>>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>>> 
>>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>>>> pass-through here.
>>>>>>>>>> 
>>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>>>> 
>>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>>> 
>>>>>>>> Unfortunately
>>>>>>>> 
>>>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>>>> traces like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>>>> (read size mask)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>>>> (restore BAR)
>>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>>> 
>>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>>> 
>>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>>> 
>>>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>>> 
>>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>>>> through, so how do I easily find out?
>>>>>>>> 
>>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>>>> know it's MMIO.
>>>>>>>> 
>>>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>>>> ugly.
>>>>>>>> 
>>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>>>> bar though, like ivshmem?
>>>>>>>> 
>>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>>> 
>>>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>>> 
>>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>>> 
>>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>>> 
>>>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>>>> test suites normally check that it actually does work
>>>>>>>>> if it happens.
>>>>>>>> 
>>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>>>> fixed address windows for RAM vs MMIO.
>>>>>>> 
>>>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>>>> Without that, devices can talk to each other without going
>>>>>>> through chipset, and bridge spec is very explicit that
>>>>>>> full 64 bit addressing must be supported.
>>>>>>> 
>>>>>>> So as long as we don't emulate an IOMMU,
>>>>>>> guest will normally think it's okay to use any address.
>>>>>>> 
>>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>>>> if you see this address in the memory map this means
>>>>>>>>> your virtual device is on root bus.
>>>>>>>>> 
>>>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>>> 
>>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>>>> 
>>>>>>>> Alex
>>>>>>> 
>>>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>>>> Long term add an ioctl to query it.
>>>>>> 
>>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>>>> physical address bits of the CPU.
>>>>>> 
>>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>>>> placing BARs above some address.
>>>>>> 
>>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>>>> building ACPI tables though.
>>>>> 
>>>>> Well the point is that if you want BIOS to avoid
>>>>> specific addresses, you need to tell it what to avoid.
>>>>> But neither BIOS nor ACPI actually cover the range above
>>>>> 2^48 ATM so it's not a high priority.
>>>>> 
>>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>>>> lines of vfio_get_addr_space_bits(void).
>>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>>> 
>>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>>> 
>>>>> That seems very wrong. It will still fail on an x86 host if we are
>>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>>>> host side there's no real reason to tie it to the target.
>>> 
>>> I doubt vfio would be the only thing broken in that case.
>>> 
>>>>>> long term vfio already has an IOMMU info
>>>>>> ioctl that we could use to return this information, but we'll need to
>>>>>> figure out how to get it out of the IOMMU driver first.
>>>>>> Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> Short term, just assume 48 bits on x86.
>>> 
>>> I hate to pick an arbitrary value since we have a very specific mapping
>>> we're trying to avoid.  Perhaps a better option is to skip anything
>>> where:
>>> 
>>>       MemoryRegionSection.offset_within_address_space >
>>>       ~MemoryRegionSection.offset_within_address_space
>>> 
>>>>> We need to figure out what's the limitation on ppc and arm -
>>>>> maybe there's none and it can address full 64 bit range.
>>>> 
>>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>>>> 
>>>> Or did I misunderstand the question?
>>> 
>>> Sounds right, if either BAR mappings outside the window will not be
>>> realized in the memory space or the IOMMU has a full 64bit address
>>> space, there's no problem.  Here we have an intermediate step in the BAR
>>> sizing producing a stray mapping that the IOMMU hardware can't handle.
>>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
>>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
>>> this then causes space and time overhead until the tables are pruned
>>> back down.  Thanks,
>> 
>> I thought sizing is hard defined as a set to
>> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
>> 
>> Alex
> 
> We already have a work-around like this and it works for 32 bit BARs
> or after software writes the full 64 register:
>    if (last_addr <= new_addr || new_addr == 0 ||
>        last_addr == PCI_BAR_UNMAPPED) {
>        return PCI_BAR_UNMAPPED;
>    }
> 
>    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
>        return PCI_BAR_UNMAPPED;
>    }
> 
> 
> But for 64 bit BARs this software writes all 1's
> in the high 32 bit register before writing in the low register
> (see trace above).
> This makes it impossible to distinguish between
> setting bar at fffffffffebe0000 and this intermediate sizing step.

Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:

	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.

Intel seems to agree:

	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.

Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).


Alex
Peter Maydell Jan. 14, 2014, 9:31 a.m. UTC | #16
On 14 January 2014 09:20, Alexander Graf <agraf@suse.de> wrote:
> Of course there's potential for future extensions to allow for more
> bits in the future, but at least the current generation x86_64 (and x86)
> specification clearly only supports 52 bits of physical address space.
> And non-x86(_64) don't care about bigger address spaces either
> because they use BAR windows which are very unlikely to grow
> bigger than 52 bits ;).

There's no reason you couldn't do an ARM (most likely AArch64)
system which dealt with PCI BARs the same way as x86 rather
than having a fixed window in the memory map; I wouldn't be
surprised if some of the server designs took that route. However
the architecture specifies a 48 bit maximum physical address.

With some of the BAR-window design PCI controllers I think it's
theoretically possible to configure the controller so that the
window shows the very top part of PCI address space and then
configure all your device BARs with very high PCI addresses.
In that case the BAR MemoryRegions would get mapped in
at very high addresses in the PCI memory address space
MemoryRegion container, and at more usual small addresses
in the system AddressSpace.

thanks
-- PMM
Avi Kivity Jan. 14, 2014, 10:24 a.m. UTC | #17
On 01/14/2014 12:48 AM, Alex Williamson wrote:
> On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
>>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
>>>
>>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>
>>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>>>> pass-through here.
>>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>>> Unfortunately
>>>>>>>>
>>>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>>>> traces like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>>>> (read size mask)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>>>> (restore BAR)
>>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>>>> through, so how do I easily find out?
>>>>>>>>
>>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>>>> know it's MMIO.
>>>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>>>> ugly.
>>>>>>>>
>>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>>>> bar though, like ivshmem?
>>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>>>
>>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>>>
>>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>>>> test suites normally check that it actually does work
>>>>>>>>> if it happens.
>>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>>>> fixed address windows for RAM vs MMIO.
>>>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>>>> Without that, devices can talk to each other without going
>>>>>>> through chipset, and bridge spec is very explicit that
>>>>>>> full 64 bit addressing must be supported.
>>>>>>>
>>>>>>> So as long as we don't emulate an IOMMU,
>>>>>>> guest will normally think it's okay to use any address.
>>>>>>>
>>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>>>> if you see this address in the memory map this means
>>>>>>>>> your virtual device is on root bus.
>>>>>>>>>
>>>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>>>> Long term add an ioctl to query it.
>>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>>>> physical address bits of the CPU.
>>>>>>
>>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>>>> placing BARs above some address.
>>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>>>> building ACPI tables though.
>>>>> Well the point is that if you want BIOS to avoid
>>>>> specific addresses, you need to tell it what to avoid.
>>>>> But neither BIOS nor ACPI actually cover the range above
>>>>> 2^48 ATM so it's not a high priority.
>>>>>
>>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>>>> lines of vfio_get_addr_space_bits(void).
>>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>>> That seems very wrong. It will still fail on an x86 host if we are
>>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>>>> host side there's no real reason to tie it to the target.
>>> I doubt vfio would be the only thing broken in that case.
>>>
>>>>>> long term vfio already has an IOMMU info
>>>>>> ioctl that we could use to return this information, but we'll need to
>>>>>> figure out how to get it out of the IOMMU driver first.
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>> Short term, just assume 48 bits on x86.
>>> I hate to pick an arbitrary value since we have a very specific mapping
>>> we're trying to avoid.  Perhaps a better option is to skip anything
>>> where:
>>>
>>>         MemoryRegionSection.offset_within_address_space >
>>>         ~MemoryRegionSection.offset_within_address_space
>>>
>>>>> We need to figure out what's the limitation on ppc and arm -
>>>>> maybe there's none and it can address full 64 bit range.
>>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>>>>
>>>> Or did I misunderstand the question?
>>> Sounds right, if either BAR mappings outside the window will not be
>>> realized in the memory space or the IOMMU has a full 64bit address
>>> space, there's no problem.  Here we have an intermediate step in the BAR
>>> sizing producing a stray mapping that the IOMMU hardware can't handle.
>>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
>>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
>>> this then causes space and time overhead until the tables are pruned
>>> back down.  Thanks,
>> I thought sizing is hard defined as a set to
>> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> PCI doesn't want to handle this as anything special to differentiate a
> sizing mask from a valid BAR address.  I agree though, I'd prefer to
> never see a spurious address like this in my MemoryListener.
>
>

Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
bios and/or linux to disable memory access while sizing.
Michael S. Tsirkin Jan. 14, 2014, 10:28 a.m. UTC | #18
On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote:
> 
> On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> >> 
> >> 
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>> 
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> 
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> 
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> 
> >>>>>>>> Unfortunately
> >>>>>>>> 
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> 
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> 
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> 
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>> 
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> 
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>> 
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> 
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> 
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>> 
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> 
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> 
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> 
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>> 
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>> 
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>> 
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> 
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> 
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>> 
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> 
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> 
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>> 
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> 
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> 
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> 
> >>> I doubt vfio would be the only thing broken in that case.
> >>> 
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> Short term, just assume 48 bits on x86.
> >>> 
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>> 
> >>>       MemoryRegionSection.offset_within_address_space >
> >>>       ~MemoryRegionSection.offset_within_address_space
> >>> 
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> 
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>> 
> >>>> Or did I misunderstand the question?
> >>> 
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> 
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >> 
> >> Alex
> > 
> > We already have a work-around like this and it works for 32 bit BARs
> > or after software writes the full 64 register:
> >    if (last_addr <= new_addr || new_addr == 0 ||
> >        last_addr == PCI_BAR_UNMAPPED) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> >    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> > 
> > But for 64 bit BARs this software writes all 1's
> > in the high 32 bit register before writing in the low register
> > (see trace above).
> > This makes it impossible to distinguish between
> > setting bar at fffffffffebe0000 and this intermediate sizing step.
> 
> Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:
> 
> 	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.
> 
> Intel seems to agree:
> 
> 	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
> 
> Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).
> 
> 
> Alex

Yes but that's from CPU's point of view.
I think that devices can still access each other's BARs
using full 64 bit addresses.
Michael S. Tsirkin Jan. 14, 2014, 10:43 a.m. UTC | #19
On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote:
> 
> On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> >> 
> >> 
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>> 
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> 
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> 
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> 
> >>>>>>>> Unfortunately
> >>>>>>>> 
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> 
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> 
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> 
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>> 
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> 
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>> 
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> 
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> 
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>> 
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> 
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> 
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> 
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>> 
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>> 
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>> 
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> 
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> 
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>> 
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> 
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> 
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>> 
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> 
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> 
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> 
> >>> I doubt vfio would be the only thing broken in that case.
> >>> 
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> Short term, just assume 48 bits on x86.
> >>> 
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>> 
> >>>       MemoryRegionSection.offset_within_address_space >
> >>>       ~MemoryRegionSection.offset_within_address_space
> >>> 
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> 
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>> 
> >>>> Or did I misunderstand the question?
> >>> 
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> 
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >> 
> >> Alex
> > 
> > We already have a work-around like this and it works for 32 bit BARs
> > or after software writes the full 64 register:
> >    if (last_addr <= new_addr || new_addr == 0 ||
> >        last_addr == PCI_BAR_UNMAPPED) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> >    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> > 
> > But for 64 bit BARs this software writes all 1's
> > in the high 32 bit register before writing in the low register
> > (see trace above).
> > This makes it impossible to distinguish between
> > setting bar at fffffffffebe0000 and this intermediate sizing step.
> 
> Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:
> 
> 	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.
> 
> Intel seems to agree:
> 
> 	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
> 
> Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).
> 
> 
> Alex


I guess we could limit pci memory to 52 easily enough.
But Alex here says IOMMU is limited to 48 bits so that
still won't be good enough in all cases, even if it helps
in this specific case.

We really need to figure out the host limitations
by querying vfio.
Michael S. Tsirkin Jan. 14, 2014, 11:50 a.m. UTC | #20
On Tue, Jan 14, 2014 at 12:24:24PM +0200, Avi Kivity wrote:
> On 01/14/2014 12:48 AM, Alex Williamson wrote:
> >On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> >>>Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>>
> >>>>On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>>On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>
> >>>>>>On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>>On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>>On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>>On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>>On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>>On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>>size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>>This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>>consequently messing up the computations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>>0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>>is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>>address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>>not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>>bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>>Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>>---
> >>>>>>>>>>>>>>exec.c | 8 ++------
> >>>>>>>>>>>>>>1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>>index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>>--- a/exec.c
> >>>>>>>>>>>>>>+++ b/exec.c
> >>>>>>>>>>>>>>@@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>>#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>/* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>>-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>>+#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>#define P_L2_BITS 10
> >>>>>>>>>>>>>>#define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>>@@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>>{
> >>>>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>-    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>>-
> >>>>>>>>>>>>>>-    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>>-                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>>-                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>>+    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>>This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>>BARs that I'm not sure how to handle.
> >>>>>>>>>>>>BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>>don't detect BAR being disabled?
> >>>>>>>>>>>See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>>the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>>pass-through here.
> >>>>>>>>>>Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>>while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>>
> >>>>>>>>>>Alex
> >>>>>>>>>OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>>Unfortunately
> >>>>>>>>
> >>>>>>>>>>>>>After this patch I get vfio
> >>>>>>>>>>>>>traces like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>>(save lower 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>>(read size mask)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>>(restore BAR)
> >>>>>>>>>>>>>vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region re-mapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>>(save upper 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region gets re-mapped with new address)
> >>>>>>>>>>>>>qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>>(iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>>Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>>Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>>Why can't you? Generally memory core let you find out easily.
> >>>>>>>>My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>>out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>>through, so how do I easily find out?
> >>>>>>>>
> >>>>>>>>>But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>>know it's MMIO.
> >>>>>>>>How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>>through to the IOMMU.  I suppose I could look through all the
> >>>>>>>>VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>>ugly.
> >>>>>>>>
> >>>>>>>>>Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>>bar though, like ivshmem?
> >>>>>>>>Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>>MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>>Must be a 64 bit BAR to trigger the issue though.
> >>>>>>>
> >>>>>>>>>>>Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>>that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>>
> >>>>>>>>>>>>>Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>>address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>>window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>>allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Alex
> >>>>>>>>>>>>Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>>True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>>What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>>a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>>serious doubts that another PCI device would be able to access it
> >>>>>>>>>>>either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>>same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>>always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>>bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>>in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>>would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>>programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>>Alex
> >>>>>>>>>AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>>full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>>test suites normally check that it actually does work
> >>>>>>>>>if it happens.
> >>>>>>>>Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>>routing, that's more what I'm referring to.  There are generally only
> >>>>>>>>fixed address windows for RAM vs MMIO.
> >>>>>>>The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>>Without that, devices can talk to each other without going
> >>>>>>>through chipset, and bridge spec is very explicit that
> >>>>>>>full 64 bit addressing must be supported.
> >>>>>>>
> >>>>>>>So as long as we don't emulate an IOMMU,
> >>>>>>>guest will normally think it's okay to use any address.
> >>>>>>>
> >>>>>>>>>Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>>windows would protect you, but pci already does this filtering:
> >>>>>>>>>if you see this address in the memory map this means
> >>>>>>>>>your virtual device is on root bus.
> >>>>>>>>>
> >>>>>>>>>So I think it's the other way around: if VFIO requires specific
> >>>>>>>>>address ranges to be assigned to devices, it should give this
> >>>>>>>>>info to qemu and qemu can give this to guest.
> >>>>>>>>>Then anything outside that range can be ignored by VFIO.
> >>>>>>>>Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>>currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>>getting by because it's safely close enough to the CPU address width to
> >>>>>>>>not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>>address space.  Maybe I can safely ignore anything above
> >>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>>
> >>>>>>>>Alex
> >>>>>>>I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>>So just make up your own constant, maybe depending on host architecture.
> >>>>>>>Long term add an ioctl to query it.
> >>>>>>It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>>physical address bits of the CPU.
> >>>>>>
> >>>>>>>Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>>placing BARs above some address.
> >>>>>>That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>>the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>>building ACPI tables though.
> >>>>>Well the point is that if you want BIOS to avoid
> >>>>>specific addresses, you need to tell it what to avoid.
> >>>>>But neither BIOS nor ACPI actually cover the range above
> >>>>>2^48 ATM so it's not a high priority.
> >>>>>
> >>>>>>>Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>>lines of vfio_get_addr_space_bits(void).
> >>>>>>>(Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>>It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>>problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>>mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>>mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>>TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>>That seems very wrong. It will still fail on an x86 host if we are
> >>>>>emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>>host side there's no real reason to tie it to the target.
> >>>I doubt vfio would be the only thing broken in that case.
> >>>
> >>>>>>long term vfio already has an IOMMU info
> >>>>>>ioctl that we could use to return this information, but we'll need to
> >>>>>>figure out how to get it out of the IOMMU driver first.
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Alex
> >>>>>Short term, just assume 48 bits on x86.
> >>>I hate to pick an arbitrary value since we have a very specific mapping
> >>>we're trying to avoid.  Perhaps a better option is to skip anything
> >>>where:
> >>>
> >>>        MemoryRegionSection.offset_within_address_space >
> >>>        ~MemoryRegionSection.offset_within_address_space
> >>>
> >>>>>We need to figure out what's the limitation on ppc and arm -
> >>>>>maybe there's none and it can address full 64 bit range.
> >>>>IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>>
> >>>>Or did I misunderstand the question?
> >>>Sounds right, if either BAR mappings outside the window will not be
> >>>realized in the memory space or the IOMMU has a full 64bit address
> >>>space, there's no problem.  Here we have an intermediate step in the BAR
> >>>sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>>Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>>the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>>this then causes space and time overhead until the tables are pruned
> >>>back down.  Thanks,
> >>I thought sizing is hard defined as a set to
> >>-1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >PCI doesn't want to handle this as anything special to differentiate a
> >sizing mask from a valid BAR address.  I agree though, I'd prefer to
> >never see a spurious address like this in my MemoryListener.
> >
> >
> 
> Can't you just ignore regions that cannot be mapped?  Oh, and teach
> the bios and/or linux to disable memory access while sizing.


I know Linux won't disable memory access while sizing because
there are some broken devices where you can't re-enable it afterwards.

It should be harmless to set BAR to any silly value as long
as you are careful not to access it.
Michael S. Tsirkin Jan. 14, 2014, 12:07 p.m. UTC | #21
On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > 
> > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > 
> > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>> 
> > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>>>> ---
> > >>>>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>>>> --- a/exec.c
> > >>>>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>>>> -
> > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>>>> 
> > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>>>> 
> > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>>>> don't detect BAR being disabled?
> > >>>>>>>>> 
> > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>>>> pass-through here.
> > >>>>>>>> 
> > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>>>> 
> > >>>>>>>> Alex
> > >>>>>>> 
> > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>>>> 
> > >>>>>> Unfortunately
> > >>>>>> 
> > >>>>>>>>>>> After this patch I get vfio
> > >>>>>>>>>>> traces like this:
> > >>>>>>>>>>> 
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>>>> (read size mask)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>>>> (restore BAR)
> > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>>>> (memory region re-mapped)
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>>> 
> > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>>>> 
> > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>>>> 
> > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > >>>>>> 
> > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>>>> through, so how do I easily find out?
> > >>>>>> 
> > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>>>> know it's MMIO.
> > >>>>>> 
> > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>>>> ugly.
> > >>>>>> 
> > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>>>> bar though, like ivshmem?
> > >>>>>> 
> > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>>>> 
> > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > >>>>> 
> > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>>>> 
> > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> 
> > >>>>>>>>>>> Alex
> > >>>>>>>>>> 
> > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>>>> 
> > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>>>> 
> > >>>>>>>>> Alex
> > >>>>>>> 
> > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>>>> test suites normally check that it actually does work
> > >>>>>>> if it happens.
> > >>>>>> 
> > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > >>>>>> fixed address windows for RAM vs MMIO.
> > >>>>> 
> > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > >>>>> Without that, devices can talk to each other without going
> > >>>>> through chipset, and bridge spec is very explicit that
> > >>>>> full 64 bit addressing must be supported.
> > >>>>> 
> > >>>>> So as long as we don't emulate an IOMMU,
> > >>>>> guest will normally think it's okay to use any address.
> > >>>>> 
> > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>>>> windows would protect you, but pci already does this filtering:
> > >>>>>>> if you see this address in the memory map this means
> > >>>>>>> your virtual device is on root bus.
> > >>>>>>> 
> > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>>>> address ranges to be assigned to devices, it should give this
> > >>>>>>> info to qemu and qemu can give this to guest.
> > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > >>>>>> 
> > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>>>> getting by because it's safely close enough to the CPU address width to
> > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>>>> address space.  Maybe I can safely ignore anything above
> > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>>>> 
> > >>>>>> Alex
> > >>>>> 
> > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > >>>>> So just make up your own constant, maybe depending on host architecture.
> > >>>>> Long term add an ioctl to query it.
> > >>>> 
> > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > >>>> physical address bits of the CPU.
> > >>>> 
> > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>>>> placing BARs above some address.
> > >>>> 
> > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > >>>> building ACPI tables though.
> > >>> 
> > >>> Well the point is that if you want BIOS to avoid
> > >>> specific addresses, you need to tell it what to avoid.
> > >>> But neither BIOS nor ACPI actually cover the range above
> > >>> 2^48 ATM so it's not a high priority.
> > >>> 
> > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>>>> lines of vfio_get_addr_space_bits(void).
> > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >>>> 
> > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > >>> 
> > >>> That seems very wrong. It will still fail on an x86 host if we are
> > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > >>> host side there's no real reason to tie it to the target.
> > > 
> > > I doubt vfio would be the only thing broken in that case.
> > > 
> > >>>> long term vfio already has an IOMMU info
> > >>>> ioctl that we could use to return this information, but we'll need to
> > >>>> figure out how to get it out of the IOMMU driver first.
> > >>>> Thanks,
> > >>>> 
> > >>>> Alex
> > >>> 
> > >>> Short term, just assume 48 bits on x86.
> > > 
> > > I hate to pick an arbitrary value since we have a very specific mapping
> > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > where:
> > > 
> > >        MemoryRegionSection.offset_within_address_space >
> > >        ~MemoryRegionSection.offset_within_address_space
> > > 
> > >>> We need to figure out what's the limitation on ppc and arm -
> > >>> maybe there's none and it can address full 64 bit range.
> > >> 
> > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > >> 
> > >> Or did I misunderstand the question?
> > > 
> > > Sounds right, if either BAR mappings outside the window will not be
> > > realized in the memory space or the IOMMU has a full 64bit address
> > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > this then causes space and time overhead until the tables are pruned
> > > back down.  Thanks,
> > 
> > I thought sizing is hard defined as a set to
> > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> 
> PCI doesn't want to handle this as anything special to differentiate a
> sizing mask from a valid BAR address.  I agree though, I'd prefer to
> never see a spurious address like this in my MemoryListener.

It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
set to all ones atomically.

Also, while it doesn't address this fully (same issue can happen
e.g. with ivshmem), do you think we should distinguish these BARs mapped
from vfio / device assignment in qemu somehow?

In particular, even when it has sane addresses:
device really can not DMA into its own BAR, that's a spec violation
so in theory can do anything including crashing the system.
I don't know what happens in practice but
if you are programming IOMMU to forward transactions back to
device that originated it, you are not doing it any favors.

I also note that if someone tries zero copy transmit out of such an
address, get user pages will fail.
I think this means tun zero copy transmit needs to fall-back
on copy from user on get user pages failure.

Jason, what's tour thinking on this?
Michael S. Tsirkin Jan. 14, 2014, 12:21 p.m. UTC | #22
On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>> 
> > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>> 
> > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>> 
> > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>> 
> > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>> 
> > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>> 
> > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>> ---
> > >>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>> 
> > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>> --- a/exec.c
> > >>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>> 
> > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>> 
> > >>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>> {
> > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>> 
> > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>> -
> > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>> 
> > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>> 
> > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>> 
> > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>> don't detect BAR being disabled?
> > >>>>>>> 
> > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>> pass-through here.
> > >>>>>> 
> > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>> 
> > >>>>>> Alex
> > >>>>> 
> > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>> 
> > >>>> Unfortunately
> > >>>> 
> > >>>>>>>>> After this patch I get vfio
> > >>>>>>>>> traces like this:
> > >>>>>>>>> 
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>> (read size mask)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>> (restore BAR)
> > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>> (memory region re-mapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>> 
> > >>>>>>>> 
> > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>> 
> > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>> 
> > >>>>> Why can't you? Generally memory core let you find out easily.
> > >>>> 
> > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>> through, so how do I easily find out?
> > >>>> 
> > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>> know it's MMIO.
> > >>>> 
> > >>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>> through to the IOMMU.  I suppose I could look through all the
> > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>> ugly.
> > >>>> 
> > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>> bar though, like ivshmem?
> > >>>> 
> > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>> 
> > >>> Must be a 64 bit BAR to trigger the issue though.
> > >>> 
> > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>> 
> > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>> Thanks,
> > >>>>>>>>> 
> > >>>>>>>>> Alex
> > >>>>>>>> 
> > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>> 
> > >>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>> 
> > >>>>>>> Alex
> > >>>>>> 
> > >>>>> 
> > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>> test suites normally check that it actually does work
> > >>>>> if it happens.
> > >>>> 
> > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>> routing, that's more what I'm referring to.  There are generally only
> > >>>> fixed address windows for RAM vs MMIO.
> > >>> 
> > >>> The physical chipset? Likely - in the presence of IOMMU.
> > >>> Without that, devices can talk to each other without going
> > >>> through chipset, and bridge spec is very explicit that
> > >>> full 64 bit addressing must be supported.
> > >>> 
> > >>> So as long as we don't emulate an IOMMU,
> > >>> guest will normally think it's okay to use any address.
> > >>> 
> > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>> windows would protect you, but pci already does this filtering:
> > >>>>> if you see this address in the memory map this means
> > >>>>> your virtual device is on root bus.
> > >>>>> 
> > >>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>> address ranges to be assigned to devices, it should give this
> > >>>>> info to qemu and qemu can give this to guest.
> > >>>>> Then anything outside that range can be ignored by VFIO.
> > >>>> 
> > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>> getting by because it's safely close enough to the CPU address width to
> > >>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>> address space.  Maybe I can safely ignore anything above
> > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>> 
> > >>>> Alex
> > >>> 
> > >>> I think it's not related to target CPU at all - it's a host limitation.
> > >>> So just make up your own constant, maybe depending on host architecture.
> > >>> Long term add an ioctl to query it.
> > >> 
> > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > >> physical address bits of the CPU.
> > >> 
> > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>> placing BARs above some address.
> > >> 
> > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > >> the BARs with them enabled.  We may still want such a thing to feed into
> > >> building ACPI tables though.
> > > 
> > > Well the point is that if you want BIOS to avoid
> > > specific addresses, you need to tell it what to avoid.
> > > But neither BIOS nor ACPI actually cover the range above
> > > 2^48 ATM so it's not a high priority.
> > > 
> > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>> lines of vfio_get_addr_space_bits(void).
> > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >> 
> > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >> mappings.  In the short term, I think I'll ignore any mappings above
> > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > 
> > > That seems very wrong. It will still fail on an x86 host if we are
> > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > host side there's no real reason to tie it to the target.
> 
> I doubt vfio would be the only thing broken in that case.

A bit cryptic.
target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
So qemu does emulate at least one full-64 bit CPU.

It's possible that something limits PCI BAR address
there, it might or might not be architectural.

> > >> long term vfio already has an IOMMU info
> > >> ioctl that we could use to return this information, but we'll need to
> > >> figure out how to get it out of the IOMMU driver first.
> > >> Thanks,
> > >> 
> > >> Alex
> > > 
> > > Short term, just assume 48 bits on x86.
> 
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid.

Well it's not a specific mapping really.

Any mapping outside host IOMMU would not work.
guests happen to trigger it while sizing but again
they are allowed to write anything into BARs really.

>  Perhaps a better option is to skip anything
> where:
> 
>         MemoryRegionSection.offset_within_address_space >
>         ~MemoryRegionSection.offset_within_address_space


This merely checks that high bit is 1, doesn't it?

So this equivalently assumes 63 bits on x86, if you prefer
63 and not 48, that's fine with me.




> > > We need to figure out what's the limitation on ppc and arm -
> > > maybe there's none and it can address full 64 bit range.
> > 
> > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > 
> > Or did I misunderstand the question?
> 
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem.  Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down.  Thanks,
> 
> Alex

In the common case of a single VFIO device per IOMMU, you really should not
add its own BARs in the IOMMU. That's not a complete fix
but it addresses the overhead concern that you mention here.
Mike D. Day Jan. 14, 2014, 1:50 p.m. UTC | #23
"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:

> Short term, just assume 48 bits on x86.
>
> We need to figure out what's the limitation on ppc and arm -
> maybe there's none and it can address full 64 bit range.
>
> Cc some people who might know about these platforms.

The document you need is here: 

http://goo.gl/fJYxdN

"PCI Bus Binding To: IEEE Std 1275-1994"

The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
and Memory mappings for BARs.

Also, both 32-bit and 64-bit BARs are required to be supported. It is
legal to construct a 64-bit BAR by masking all the high bits to
zero. Presumably it would be OK to mask the 16 high bits to zero as
well, constructing a 48-bit address.

Mike
Michael S. Tsirkin Jan. 14, 2014, 2:05 p.m. UTC | #24
On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:
> 
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> 
> > Short term, just assume 48 bits on x86.
> >
> > We need to figure out what's the limitation on ppc and arm -
> > maybe there's none and it can address full 64 bit range.
> >
> > Cc some people who might know about these platforms.
> 
> The document you need is here: 
> 
> http://goo.gl/fJYxdN
> 
> "PCI Bus Binding To: IEEE Std 1275-1994"
> 
> The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
> and Memory mappings for BARs.
> 
> Also, both 32-bit and 64-bit BARs are required to be supported. It is
> legal to construct a 64-bit BAR by masking all the high bits to
> zero. Presumably it would be OK to mask the 16 high bits to zero as
> well, constructing a 48-bit address.
> 
> Mike
> 
> -- 
> Mike Day | "Endurance is a Virtue"

The question was whether addresses such as 
0xfffffffffec00000 can be a valid BAR value on these
platforms, whether it's accessible to the CPU and
to other PCI devices.
Mike D. Day Jan. 14, 2014, 3:01 p.m. UTC | #25
On Tue, Jan 14, 2014 at 9:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:

>>
>> Also, both 32-bit and 64-bit BARs are required to be supported. It is
>> legal to construct a 64-bit BAR by masking all the high bits to
>> zero. Presumably it would be OK to mask the 16 high bits to zero as
>> well, constructing a 48-bit address.

> The question was whether addresses such as
> 0xfffffffffec00000 can be a valid BAR value on these
> platforms, whether it's accessible to the CPU and
> to other PCI devices.

The answer has to be no at least for Linux. Linux uses the high bit of
the page table address as state to indicate a huge page and uses
48-bit addresses. Each PCI device is different but right now Power7
supports 16TB of RAM so I don't think the PCI bridge would necessarily
decode the high 16 bits of the memory address. For two PCI devices to
communicate with each other using 64-bit addresses they both need to
support 64-bit memory in the same address range, which is possible.
All this info subject to Paul Mackerras or Alexy …

Mike
Alex Williamson Jan. 14, 2014, 3:36 p.m. UTC | #26
On Tue, 2014-01-14 at 12:24 +0200, Avi Kivity wrote:
> On 01/14/2014 12:48 AM, Alex Williamson wrote:
> > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>>
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> Unfortunately
> >>>>>>>>
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>>
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>>
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>>
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>>
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>>
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>>
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>>
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>>
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>>
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> I doubt vfio would be the only thing broken in that case.
> >>>
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>> Short term, just assume 48 bits on x86.
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>>
> >>>         MemoryRegionSection.offset_within_address_space >
> >>>         ~MemoryRegionSection.offset_within_address_space
> >>>
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>>
> >>>> Or did I misunderstand the question?
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > PCI doesn't want to handle this as anything special to differentiate a
> > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > never see a spurious address like this in my MemoryListener.
> >
> >
> 
> Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
> bios and/or linux to disable memory access while sizing.

Actually I think we need to be more stringent about DMA mapping
failures.  If a chunk of guest RAM fails to map then we can lose data if
the device attempts to DMA a packet into it.  How do we know which
regions we can ignore and which we can't?  Whether or not the CPU can
access it is a pretty good hint that we can ignore it.  Thanks,

Alex
Alex Williamson Jan. 14, 2014, 3:49 p.m. UTC | #27
On Tue, 2014-01-14 at 14:21 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> > On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>> 
> > > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > >>>>>>>>>> consequently messing up the computations.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > >>>>>>>>>> 
> > > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > >>>>>>>>>> 
> > > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>>>>>>> ---
> > > >>>>>>>>>> exec.c | 8 ++------
> > > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > >>>>>>>>>> --- a/exec.c
> > > >>>>>>>>>> +++ b/exec.c
> > > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > >>>>>>>>>> 
> > > >>>>>>>>>> #define P_L2_BITS 10
> > > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >>>>>>>>>> {
> > > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > > >>>>>>>>>> 
> > > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > >>>>>>>>>> -
> > > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > > >>>>>>>>>> 
> > > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > > >>>>>>>>> 
> > > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > >>>>>>>>> BARs that I'm not sure how to handle.
> > > >>>>>>>> 
> > > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > > >>>>>>>> don't detect BAR being disabled?
> > > >>>>>>> 
> > > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > >>>>>>> pass-through here.
> > > >>>>>> 
> > > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > >>>>>> 
> > > >>>>>> Alex
> > > >>>>> 
> > > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > > >>>> 
> > > >>>> Unfortunately
> > > >>>> 
> > > >>>>>>>>> After this patch I get vfio
> > > >>>>>>>>> traces like this:
> > > >>>>>>>>> 
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > >>>>>>>>> (save lower 32bits of BAR)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > >>>>>>>>> (write mask to BAR)
> > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > >>>>>>>>> (read size mask)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > >>>>>>>>> (restore BAR)
> > > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > >>>>>>>>> (memory region re-mapped)
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > >>>>>>>>> (save upper 32bits of BAR)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > >>>>>>>>> (write mask to BAR)
> > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > >>>>>>>>> (memory region gets re-mapped with new address)
> > > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > >>>>>>>>> 
> > > >>>>>>>> 
> > > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > >>>>>>> 
> > > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > >>>>> 
> > > >>>>> Why can't you? Generally memory core let you find out easily.
> > > >>>> 
> > > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > >>>> out anything that's not memory_region_is_ram().  This still gets
> > > >>>> through, so how do I easily find out?
> > > >>>> 
> > > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > > >>>>> know it's MMIO.
> > > >>>> 
> > > >>>> How so?  I have a MemoryListener as described above and pass everything
> > > >>>> through to the IOMMU.  I suppose I could look through all the
> > > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > >>>> ugly.
> > > >>>> 
> > > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > >>>>> bar though, like ivshmem?
> > > >>>> 
> > > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > >>> 
> > > >>> Must be a 64 bit BAR to trigger the issue though.
> > > >>> 
> > > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > >>>>>>> 
> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> 
> > > >>>>>>>>> Alex
> > > >>>>>>>> 
> > > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > > >>>>>>> 
> > > >>>>>>> What happens on real hardware when an address like this is programmed to
> > > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > >>>>>>> serious doubts that another PCI device would be able to access it
> > > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > >>>>>>> 
> > > >>>>>>> Alex
> > > >>>>>> 
> > > >>>>> 
> > > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > >>>>> full 64 bit addresses must be allowed and hardware validation
> > > >>>>> test suites normally check that it actually does work
> > > >>>>> if it happens.
> > > >>>> 
> > > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > >>>> routing, that's more what I'm referring to.  There are generally only
> > > >>>> fixed address windows for RAM vs MMIO.
> > > >>> 
> > > >>> The physical chipset? Likely - in the presence of IOMMU.
> > > >>> Without that, devices can talk to each other without going
> > > >>> through chipset, and bridge spec is very explicit that
> > > >>> full 64 bit addressing must be supported.
> > > >>> 
> > > >>> So as long as we don't emulate an IOMMU,
> > > >>> guest will normally think it's okay to use any address.
> > > >>> 
> > > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > >>>>> windows would protect you, but pci already does this filtering:
> > > >>>>> if you see this address in the memory map this means
> > > >>>>> your virtual device is on root bus.
> > > >>>>> 
> > > >>>>> So I think it's the other way around: if VFIO requires specific
> > > >>>>> address ranges to be assigned to devices, it should give this
> > > >>>>> info to qemu and qemu can give this to guest.
> > > >>>>> Then anything outside that range can be ignored by VFIO.
> > > >>>> 
> > > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > > >>>> getting by because it's safely close enough to the CPU address width to
> > > >>>> not be a concern until we start exposing things at the top of the 64bit
> > > >>>> address space.  Maybe I can safely ignore anything above
> > > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > >>>> 
> > > >>>> Alex
> > > >>> 
> > > >>> I think it's not related to target CPU at all - it's a host limitation.
> > > >>> So just make up your own constant, maybe depending on host architecture.
> > > >>> Long term add an ioctl to query it.
> > > >> 
> > > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > > >> physical address bits of the CPU.
> > > >> 
> > > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > >>> placing BARs above some address.
> > > >> 
> > > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > > >> the BARs with them enabled.  We may still want such a thing to feed into
> > > >> building ACPI tables though.
> > > > 
> > > > Well the point is that if you want BIOS to avoid
> > > > specific addresses, you need to tell it what to avoid.
> > > > But neither BIOS nor ACPI actually cover the range above
> > > > 2^48 ATM so it's not a high priority.
> > > > 
> > > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > >>> lines of vfio_get_addr_space_bits(void).
> > > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > >> 
> > > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > >> mappings.  In the short term, I think I'll ignore any mappings above
> > > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > 
> > > > That seems very wrong. It will still fail on an x86 host if we are
> > > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> 
> A bit cryptic.
> target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
> So qemu does emulate at least one full-64 bit CPU.
> 
> It's possible that something limits PCI BAR address
> there, it might or might not be architectural.
> 
> > > >> long term vfio already has an IOMMU info
> > > >> ioctl that we could use to return this information, but we'll need to
> > > >> figure out how to get it out of the IOMMU driver first.
> > > >> Thanks,
> > > >> 
> > > >> Alex
> > > > 
> > > > Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.
> 
> Well it's not a specific mapping really.
> 
> Any mapping outside host IOMMU would not work.
> guests happen to trigger it while sizing but again
> they are allowed to write anything into BARs really.
> 
> >  Perhaps a better option is to skip anything
> > where:
> > 
> >         MemoryRegionSection.offset_within_address_space >
> >         ~MemoryRegionSection.offset_within_address_space
> 
> 
> This merely checks that high bit is 1, doesn't it?
> 
> So this equivalently assumes 63 bits on x86, if you prefer
> 63 and not 48, that's fine with me.
> 
> 
> 
> 
> > > > We need to figure out what's the limitation on ppc and arm -
> > > > maybe there's none and it can address full 64 bit range.
> > > 
> > > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > 
> > > Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> > 
> > Alex
> 
> In the common case of a single VFIO device per IOMMU, you really should not
> add its own BARs in the IOMMU. That's not a complete fix
> but it addresses the overhead concern that you mention here.

That seems like a big assumption.  We now have support for assigned GPUs
which can be paired to do SLI.  One way they might do SLI is via
peer-to-peer DMA.  We can enable that by mapping device BARs through the
IOMMU.  So it seems quite valid to want to map these.

If we choose not to map them, how do we distinguish them from guest RAM?
There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
that points to a chunk of guest memory from one that points to the mmap
of a device BAR.  I think I'd need to explicitly walk all of the vfio
device and try to match the MemoryRegion pointer to one of my devices.
That only solves the problem for vfio devices and not ivshmem devices or
pci-assign devices.

Another idea I was trying to implement is that we can enable the mmap
MemoryRegion lazily on first access.  That means we would ignore these
spurious bogus mappings because they never get accessed.  Two problems
though, first how/where to disable the mmap MemoryRegion (modifying the
memory map from within MemoryListener.region_del seems to do bad
things), second we can't handle the case of a BAR only being accessed
via peer-to-peer (which seems unlikely).  Perhaps the nail in the coffin
again is that it only solves the problem for vfio devices, spurious
mappings from other devices backed by ram_ptr will still fault.  Thanks,

Alex
Alex Williamson Jan. 14, 2014, 3:57 p.m. UTC | #28
On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > 
> > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > 
> > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >>> 
> > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > >>>>>>>>>>>> consequently messing up the computations.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>>>>>>>>> ---
> > > >>>>>>>>>>>> exec.c | 8 ++------
> > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > >>>>>>>>>>>> --- a/exec.c
> > > >>>>>>>>>>>> +++ b/exec.c
> > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > >>>>>>>>>>>> -
> > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > >>>>>>>>>> don't detect BAR being disabled?
> > > >>>>>>>>> 
> > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > >>>>>>>>> pass-through here.
> > > >>>>>>>> 
> > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > >>>>>>>> 
> > > >>>>>>>> Alex
> > > >>>>>>> 
> > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > >>>>>> 
> > > >>>>>> Unfortunately
> > > >>>>>> 
> > > >>>>>>>>>>> After this patch I get vfio
> > > >>>>>>>>>>> traces like this:
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > >>>>>>>>>>> (write mask to BAR)
> > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > >>>>>>>>>>> (read size mask)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > >>>>>>>>>>> (restore BAR)
> > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > >>>>>>>>>>> (memory region re-mapped)
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > >>>>>>>>>>> (write mask to BAR)
> > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > >>>>>>>>> 
> > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > >>>>>>> 
> > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > >>>>>> 
> > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > >>>>>> through, so how do I easily find out?
> > > >>>>>> 
> > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > >>>>>>> know it's MMIO.
> > > >>>>>> 
> > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > >>>>>> ugly.
> > > >>>>>> 
> > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > >>>>>>> bar though, like ivshmem?
> > > >>>>>> 
> > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > >>>>> 
> > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > >>>>> 
> > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > >>>>>>>>> 
> > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> Alex
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > >>>>>>>>> 
> > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > >>>>>>>>> 
> > > >>>>>>>>> Alex
> > > >>>>>>> 
> > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > >>>>>>> test suites normally check that it actually does work
> > > >>>>>>> if it happens.
> > > >>>>>> 
> > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > >>>>>> fixed address windows for RAM vs MMIO.
> > > >>>>> 
> > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > >>>>> Without that, devices can talk to each other without going
> > > >>>>> through chipset, and bridge spec is very explicit that
> > > >>>>> full 64 bit addressing must be supported.
> > > >>>>> 
> > > >>>>> So as long as we don't emulate an IOMMU,
> > > >>>>> guest will normally think it's okay to use any address.
> > > >>>>> 
> > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > >>>>>>> if you see this address in the memory map this means
> > > >>>>>>> your virtual device is on root bus.
> > > >>>>>>> 
> > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > >>>>>>> info to qemu and qemu can give this to guest.
> > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > >>>>>> 
> > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > >>>>>> 
> > > >>>>>> Alex
> > > >>>>> 
> > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > >>>>> Long term add an ioctl to query it.
> > > >>>> 
> > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > >>>> physical address bits of the CPU.
> > > >>>> 
> > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > >>>>> placing BARs above some address.
> > > >>>> 
> > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > >>>> building ACPI tables though.
> > > >>> 
> > > >>> Well the point is that if you want BIOS to avoid
> > > >>> specific addresses, you need to tell it what to avoid.
> > > >>> But neither BIOS nor ACPI actually cover the range above
> > > >>> 2^48 ATM so it's not a high priority.
> > > >>> 
> > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > >>>> 
> > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > >>> 
> > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > >>> host side there's no real reason to tie it to the target.
> > > > 
> > > > I doubt vfio would be the only thing broken in that case.
> > > > 
> > > >>>> long term vfio already has an IOMMU info
> > > >>>> ioctl that we could use to return this information, but we'll need to
> > > >>>> figure out how to get it out of the IOMMU driver first.
> > > >>>> Thanks,
> > > >>>> 
> > > >>>> Alex
> > > >>> 
> > > >>> Short term, just assume 48 bits on x86.
> > > > 
> > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > where:
> > > > 
> > > >        MemoryRegionSection.offset_within_address_space >
> > > >        ~MemoryRegionSection.offset_within_address_space
> > > > 
> > > >>> We need to figure out what's the limitation on ppc and arm -
> > > >>> maybe there's none and it can address full 64 bit range.
> > > >> 
> > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > >> 
> > > >> Or did I misunderstand the question?
> > > > 
> > > > Sounds right, if either BAR mappings outside the window will not be
> > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > this then causes space and time overhead until the tables are pruned
> > > > back down.  Thanks,
> > > 
> > > I thought sizing is hard defined as a set to
> > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > 
> > PCI doesn't want to handle this as anything special to differentiate a
> > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > never see a spurious address like this in my MemoryListener.
> 
> It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> set to all ones atomically.
> 
> Also, while it doesn't address this fully (same issue can happen
> e.g. with ivshmem), do you think we should distinguish these BARs mapped
> from vfio / device assignment in qemu somehow?
> 
> In particular, even when it has sane addresses:
> device really can not DMA into its own BAR, that's a spec violation
> so in theory can do anything including crashing the system.
> I don't know what happens in practice but
> if you are programming IOMMU to forward transactions back to
> device that originated it, you are not doing it any favors.

I might concede that peer-to-peer is more trouble than it's worth if I
had a convenient way to ignore MMIO mappings in my MemoryListener, but I
don't.  Self-DMA is really not the intent of doing the mapping, but
peer-to-peer does have merit.

> I also note that if someone tries zero copy transmit out of such an
> address, get user pages will fail.
> I think this means tun zero copy transmit needs to fall-back
> on copy from user on get user pages failure.
> 
> Jason, what's tour thinking on this?
>
Michael S. Tsirkin Jan. 14, 2014, 4:03 p.m. UTC | #29
On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > 
> > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > 
> > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >>> 
> > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>>>>>>>>> ---
> > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > >>>>>>>>>>>> --- a/exec.c
> > > > >>>>>>>>>>>> +++ b/exec.c
> > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > >>>>>>>>>>>> -
> > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > >>>>>>>>> 
> > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > >>>>>>>>> pass-through here.
> > > > >>>>>>>> 
> > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > >>>>>>>> 
> > > > >>>>>>>> Alex
> > > > >>>>>>> 
> > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > >>>>>> 
> > > > >>>>>> Unfortunately
> > > > >>>>>> 
> > > > >>>>>>>>>>> After this patch I get vfio
> > > > >>>>>>>>>>> traces like this:
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > >>>>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > >>>>>>>>>>> (read size mask)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > >>>>>>>>>>> (restore BAR)
> > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > >>>>>>>>>>> (memory region re-mapped)
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > >>>>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > >>>>>>> 
> > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > >>>>>> 
> > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > >>>>>> through, so how do I easily find out?
> > > > >>>>>> 
> > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > >>>>>>> know it's MMIO.
> > > > >>>>>> 
> > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > >>>>>> ugly.
> > > > >>>>>> 
> > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > >>>>>>> bar though, like ivshmem?
> > > > >>>>>> 
> > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > >>>>> 
> > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > >>>>> 
> > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > >>>>>>>>> 
> > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> Alex
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > >>>>>>>>> 
> > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Alex
> > > > >>>>>>> 
> > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > >>>>>>> test suites normally check that it actually does work
> > > > >>>>>>> if it happens.
> > > > >>>>>> 
> > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > >>>>> 
> > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > >>>>> Without that, devices can talk to each other without going
> > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > >>>>> full 64 bit addressing must be supported.
> > > > >>>>> 
> > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > >>>>> guest will normally think it's okay to use any address.
> > > > >>>>> 
> > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > >>>>>>> if you see this address in the memory map this means
> > > > >>>>>>> your virtual device is on root bus.
> > > > >>>>>>> 
> > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > >>>>>> 
> > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > >>>>>> 
> > > > >>>>>> Alex
> > > > >>>>> 
> > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > >>>>> Long term add an ioctl to query it.
> > > > >>>> 
> > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > >>>> physical address bits of the CPU.
> > > > >>>> 
> > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > >>>>> placing BARs above some address.
> > > > >>>> 
> > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > >>>> building ACPI tables though.
> > > > >>> 
> > > > >>> Well the point is that if you want BIOS to avoid
> > > > >>> specific addresses, you need to tell it what to avoid.
> > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > >>> 2^48 ATM so it's not a high priority.
> > > > >>> 
> > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > >>>> 
> > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > >>> 
> > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > >>> host side there's no real reason to tie it to the target.
> > > > > 
> > > > > I doubt vfio would be the only thing broken in that case.
> > > > > 
> > > > >>>> long term vfio already has an IOMMU info
> > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > >>>> Thanks,
> > > > >>>> 
> > > > >>>> Alex
> > > > >>> 
> > > > >>> Short term, just assume 48 bits on x86.
> > > > > 
> > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > where:
> > > > > 
> > > > >        MemoryRegionSection.offset_within_address_space >
> > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > 
> > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > >>> maybe there's none and it can address full 64 bit range.
> > > > >> 
> > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > >> 
> > > > >> Or did I misunderstand the question?
> > > > > 
> > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > this then causes space and time overhead until the tables are pruned
> > > > > back down.  Thanks,
> > > > 
> > > > I thought sizing is hard defined as a set to
> > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > 
> > > PCI doesn't want to handle this as anything special to differentiate a
> > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > never see a spurious address like this in my MemoryListener.
> > 
> > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > set to all ones atomically.
> > 
> > Also, while it doesn't address this fully (same issue can happen
> > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > from vfio / device assignment in qemu somehow?
> > 
> > In particular, even when it has sane addresses:
> > device really can not DMA into its own BAR, that's a spec violation
> > so in theory can do anything including crashing the system.
> > I don't know what happens in practice but
> > if you are programming IOMMU to forward transactions back to
> > device that originated it, you are not doing it any favors.
> 
> I might concede that peer-to-peer is more trouble than it's worth if I
> had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> don't.

Well for VFIO devices you are creating these mappings so we surely
can find a way for you to check that.
Doesn't each segment point back at the memory region that created it?
Then you can just check that.

>  Self-DMA is really not the intent of doing the mapping, but
> peer-to-peer does have merit.
> 
> > I also note that if someone tries zero copy transmit out of such an
> > address, get user pages will fail.
> > I think this means tun zero copy transmit needs to fall-back
> > on copy from user on get user pages failure.
> > 
> > Jason, what's tour thinking on this?
> > 
> 
>
Michael S. Tsirkin Jan. 14, 2014, 4:07 p.m. UTC | #30
On Tue, Jan 14, 2014 at 08:49:39AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 14:21 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> > > On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > 
> > > > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > >>>>>>>>>> consequently messing up the computations.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>>>>>>> ---
> > > > >>>>>>>>>> exec.c | 8 ++------
> > > > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > >>>>>>>>>> --- a/exec.c
> > > > >>>>>>>>>> +++ b/exec.c
> > > > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> #define P_L2_BITS 10
> > > > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >>>>>>>>>> {
> > > > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > >>>>>>>>>> -
> > > > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > > > >>>>>>>>> 
> > > > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > >>>>>>>>> BARs that I'm not sure how to handle.
> > > > >>>>>>>> 
> > > > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > >>>>>>>> don't detect BAR being disabled?
> > > > >>>>>>> 
> > > > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > >>>>>>> pass-through here.
> > > > >>>>>> 
> > > > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > >>>>>> 
> > > > >>>>>> Alex
> > > > >>>>> 
> > > > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > >>>> 
> > > > >>>> Unfortunately
> > > > >>>> 
> > > > >>>>>>>>> After this patch I get vfio
> > > > >>>>>>>>> traces like this:
> > > > >>>>>>>>> 
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > >>>>>>>>> (save lower 32bits of BAR)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > >>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > >>>>>>>>> (read size mask)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > >>>>>>>>> (restore BAR)
> > > > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > >>>>>>>>> (memory region re-mapped)
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > >>>>>>>>> (save upper 32bits of BAR)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > >>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > >>>>>>>>> (memory region gets re-mapped with new address)
> > > > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > >>>>>>>>> 
> > > > >>>>>>>> 
> > > > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > >>>>>>> 
> > > > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > >>>>> 
> > > > >>>>> Why can't you? Generally memory core let you find out easily.
> > > > >>>> 
> > > > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > >>>> out anything that's not memory_region_is_ram().  This still gets
> > > > >>>> through, so how do I easily find out?
> > > > >>>> 
> > > > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > >>>>> know it's MMIO.
> > > > >>>> 
> > > > >>>> How so?  I have a MemoryListener as described above and pass everything
> > > > >>>> through to the IOMMU.  I suppose I could look through all the
> > > > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > >>>> ugly.
> > > > >>>> 
> > > > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > >>>>> bar though, like ivshmem?
> > > > >>>> 
> > > > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > >>> 
> > > > >>> Must be a 64 bit BAR to trigger the issue though.
> > > > >>> 
> > > > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > >>>>>>> 
> > > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > >>>>>>>>> Thanks,
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Alex
> > > > >>>>>>>> 
> > > > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > >>>>>>> 
> > > > >>>>>>> What happens on real hardware when an address like this is programmed to
> > > > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > >>>>>>> serious doubts that another PCI device would be able to access it
> > > > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > >>>>>>> 
> > > > >>>>>>> Alex
> > > > >>>>>> 
> > > > >>>>> 
> > > > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > >>>>> full 64 bit addresses must be allowed and hardware validation
> > > > >>>>> test suites normally check that it actually does work
> > > > >>>>> if it happens.
> > > > >>>> 
> > > > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > >>>> routing, that's more what I'm referring to.  There are generally only
> > > > >>>> fixed address windows for RAM vs MMIO.
> > > > >>> 
> > > > >>> The physical chipset? Likely - in the presence of IOMMU.
> > > > >>> Without that, devices can talk to each other without going
> > > > >>> through chipset, and bridge spec is very explicit that
> > > > >>> full 64 bit addressing must be supported.
> > > > >>> 
> > > > >>> So as long as we don't emulate an IOMMU,
> > > > >>> guest will normally think it's okay to use any address.
> > > > >>> 
> > > > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > >>>>> windows would protect you, but pci already does this filtering:
> > > > >>>>> if you see this address in the memory map this means
> > > > >>>>> your virtual device is on root bus.
> > > > >>>>> 
> > > > >>>>> So I think it's the other way around: if VFIO requires specific
> > > > >>>>> address ranges to be assigned to devices, it should give this
> > > > >>>>> info to qemu and qemu can give this to guest.
> > > > >>>>> Then anything outside that range can be ignored by VFIO.
> > > > >>>> 
> > > > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > >>>> getting by because it's safely close enough to the CPU address width to
> > > > >>>> not be a concern until we start exposing things at the top of the 64bit
> > > > >>>> address space.  Maybe I can safely ignore anything above
> > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > >>>> 
> > > > >>>> Alex
> > > > >>> 
> > > > >>> I think it's not related to target CPU at all - it's a host limitation.
> > > > >>> So just make up your own constant, maybe depending on host architecture.
> > > > >>> Long term add an ioctl to query it.
> > > > >> 
> > > > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > >> physical address bits of the CPU.
> > > > >> 
> > > > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > >>> placing BARs above some address.
> > > > >> 
> > > > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > >> the BARs with them enabled.  We may still want such a thing to feed into
> > > > >> building ACPI tables though.
> > > > > 
> > > > > Well the point is that if you want BIOS to avoid
> > > > > specific addresses, you need to tell it what to avoid.
> > > > > But neither BIOS nor ACPI actually cover the range above
> > > > > 2^48 ATM so it's not a high priority.
> > > > > 
> > > > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > >>> lines of vfio_get_addr_space_bits(void).
> > > > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > >> 
> > > > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > >> mappings.  In the short term, I think I'll ignore any mappings above
> > > > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > 
> > > > > That seems very wrong. It will still fail on an x86 host if we are
> > > > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > host side there's no real reason to tie it to the target.
> > > 
> > > I doubt vfio would be the only thing broken in that case.
> > 
> > A bit cryptic.
> > target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
> > So qemu does emulate at least one full-64 bit CPU.
> > 
> > It's possible that something limits PCI BAR address
> > there, it might or might not be architectural.
> > 
> > > > >> long term vfio already has an IOMMU info
> > > > >> ioctl that we could use to return this information, but we'll need to
> > > > >> figure out how to get it out of the IOMMU driver first.
> > > > >> Thanks,
> > > > >> 
> > > > >> Alex
> > > > > 
> > > > > Short term, just assume 48 bits on x86.
> > > 
> > > I hate to pick an arbitrary value since we have a very specific mapping
> > > we're trying to avoid.
> > 
> > Well it's not a specific mapping really.
> > 
> > Any mapping outside host IOMMU would not work.
> > guests happen to trigger it while sizing but again
> > they are allowed to write anything into BARs really.
> > 
> > >  Perhaps a better option is to skip anything
> > > where:
> > > 
> > >         MemoryRegionSection.offset_within_address_space >
> > >         ~MemoryRegionSection.offset_within_address_space
> > 
> > 
> > This merely checks that high bit is 1, doesn't it?
> > 
> > So this equivalently assumes 63 bits on x86, if you prefer
> > 63 and not 48, that's fine with me.
> > 
> > 
> > 
> > 
> > > > > We need to figure out what's the limitation on ppc and arm -
> > > > > maybe there's none and it can address full 64 bit range.
> > > > 
> > > > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > 
> > > > Or did I misunderstand the question?
> > > 
> > > Sounds right, if either BAR mappings outside the window will not be
> > > realized in the memory space or the IOMMU has a full 64bit address
> > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > this then causes space and time overhead until the tables are pruned
> > > back down.  Thanks,
> > > 
> > > Alex
> > 
> > In the common case of a single VFIO device per IOMMU, you really should not
> > add its own BARs in the IOMMU. That's not a complete fix
> > but it addresses the overhead concern that you mention here.
> 
> That seems like a big assumption.  We now have support for assigned GPUs
> which can be paired to do SLI.  One way they might do SLI is via
> peer-to-peer DMA.  We can enable that by mapping device BARs through the
> IOMMU.  So it seems quite valid to want to map these.

Absolutely. But then question is how do we know guest isn't
intentionally mapping these at addresses inaccessible to the guest VCPU?

> If we choose not to map them, how do we distinguish them from guest RAM?
> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> that points to a chunk of guest memory from one that points to the mmap
> of a device BAR.

We could invent one, it's not hard.

> I think I'd need to explicitly walk all of the vfio
> device and try to match the MemoryRegion pointer to one of my devices.
> That only solves the problem for vfio devices and not ivshmem devices or
> pci-assign devices.

Not sure what's mean by "the problem". I merely say that iommu
should not loopback transactions from a device back to itself.

> Another idea I was trying to implement is that we can enable the mmap
> MemoryRegion lazily on first access.  That means we would ignore these
> spurious bogus mappings because they never get accessed.  Two problems
> though, first how/where to disable the mmap MemoryRegion (modifying the
> memory map from within MemoryListener.region_del seems to do bad
> things), second we can't handle the case of a BAR only being accessed
> via peer-to-peer (which seems unlikely).  Perhaps the nail in the coffin
> again is that it only solves the problem for vfio devices, spurious
> mappings from other devices backed by ram_ptr will still fault.  Thanks,
> 
> Alex

In the end, I think it's an iommu limitation. If you agree we should
handle it as such: expose an API telling QEMU what the limitation is,
and we'll do our best to make sure guests don't use it for valid BARs,
using ACPI etc.
You will then be able to assume anything outside the valid range
can be skipped.
Alex Williamson Jan. 14, 2014, 4:15 p.m. UTC | #31
On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > 
> > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > 
> > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >>> 
> > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > >>>>>>>>>>>> ---
> > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > >>>>>>>>>>>> {
> > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > >>>>>>>>>>>> -
> > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > >>>>>>>>> pass-through here.
> > > > > >>>>>>>> 
> > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > >>>>>>>> 
> > > > > >>>>>>>> Alex
> > > > > >>>>>>> 
> > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > >>>>>> 
> > > > > >>>>>> Unfortunately
> > > > > >>>>>> 
> > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > >>>>>>>>>>> traces like this:
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > >>>>>>>>>>> (read size mask)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > >>>>>>>>>>> (restore BAR)
> > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > >>>>>>> 
> > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > >>>>>> 
> > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > >>>>>> through, so how do I easily find out?
> > > > > >>>>>> 
> > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > >>>>>>> know it's MMIO.
> > > > > >>>>>> 
> > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > >>>>>> ugly.
> > > > > >>>>>> 
> > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > >>>>>>> bar though, like ivshmem?
> > > > > >>>>>> 
> > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > >>>>> 
> > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > >>>>> 
> > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> Alex
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> Alex
> > > > > >>>>>>> 
> > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > >>>>>>> test suites normally check that it actually does work
> > > > > >>>>>>> if it happens.
> > > > > >>>>>> 
> > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > >>>>> 
> > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > >>>>> Without that, devices can talk to each other without going
> > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > >>>>> full 64 bit addressing must be supported.
> > > > > >>>>> 
> > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > >>>>> 
> > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > >>>>>>> if you see this address in the memory map this means
> > > > > >>>>>>> your virtual device is on root bus.
> > > > > >>>>>>> 
> > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > >>>>>> 
> > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > >>>>>> 
> > > > > >>>>>> Alex
> > > > > >>>>> 
> > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > >>>>> Long term add an ioctl to query it.
> > > > > >>>> 
> > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > >>>> physical address bits of the CPU.
> > > > > >>>> 
> > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > >>>>> placing BARs above some address.
> > > > > >>>> 
> > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > >>>> building ACPI tables though.
> > > > > >>> 
> > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > >>> 
> > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > >>>> 
> > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > >>> 
> > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > 
> > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > 
> > > > > >>>> long term vfio already has an IOMMU info
> > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > >>>> Thanks,
> > > > > >>>> 
> > > > > >>>> Alex
> > > > > >>> 
> > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > 
> > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > where:
> > > > > > 
> > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > 
> > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > >> 
> > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > >> 
> > > > > >> Or did I misunderstand the question?
> > > > > > 
> > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > back down.  Thanks,
> > > > > 
> > > > > I thought sizing is hard defined as a set to
> > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > 
> > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > never see a spurious address like this in my MemoryListener.
> > > 
> > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > set to all ones atomically.
> > > 
> > > Also, while it doesn't address this fully (same issue can happen
> > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > from vfio / device assignment in qemu somehow?
> > > 
> > > In particular, even when it has sane addresses:
> > > device really can not DMA into its own BAR, that's a spec violation
> > > so in theory can do anything including crashing the system.
> > > I don't know what happens in practice but
> > > if you are programming IOMMU to forward transactions back to
> > > device that originated it, you are not doing it any favors.
> > 
> > I might concede that peer-to-peer is more trouble than it's worth if I
> > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > don't.
> 
> Well for VFIO devices you are creating these mappings so we surely
> can find a way for you to check that.
> Doesn't each segment point back at the memory region that created it?
> Then you can just check that.

It's a fairly heavy-weight search and it only avoid vfio devices, so it
feels like it's just delaying a real solution.

> >  Self-DMA is really not the intent of doing the mapping, but
> > peer-to-peer does have merit.
> > 
> > > I also note that if someone tries zero copy transmit out of such an
> > > address, get user pages will fail.
> > > I think this means tun zero copy transmit needs to fall-back
> > > on copy from user on get user pages failure.
> > > 
> > > Jason, what's tour thinking on this?
> > > 
> > 
> >
Michael S. Tsirkin Jan. 14, 2014, 4:18 p.m. UTC | #32
On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > 
> > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > 
> > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >>> 
> > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > >>>>>>>>>>>> ---
> > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > >>>>>>>>>>>> {
> > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > >>>>>>>>>>>> -
> > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > >>>>>>>>> pass-through here.
> > > > > > >>>>>>>> 
> > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > >>>>>>>> 
> > > > > > >>>>>>>> Alex
> > > > > > >>>>>>> 
> > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > >>>>>> 
> > > > > > >>>>>> Unfortunately
> > > > > > >>>>>> 
> > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > >>>>>>>>>>> traces like this:
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > >>>>>>>>>>> (read size mask)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > >>>>>>> 
> > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > >>>>>> 
> > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > >>>>>> through, so how do I easily find out?
> > > > > > >>>>>> 
> > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > >>>>>>> know it's MMIO.
> > > > > > >>>>>> 
> > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > >>>>>> ugly.
> > > > > > >>>>>> 
> > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > >>>>>> 
> > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > >>>>> 
> > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > >>>>> 
> > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > >>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> Alex
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> Alex
> > > > > > >>>>>>> 
> > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > >>>>>>> if it happens.
> > > > > > >>>>>> 
> > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > >>>>> 
> > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > >>>>> 
> > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > >>>>> 
> > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > >>>>>>> 
> > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > >>>>>> 
> > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > >>>>>> 
> > > > > > >>>>>> Alex
> > > > > > >>>>> 
> > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > >>>> 
> > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > >>>> physical address bits of the CPU.
> > > > > > >>>> 
> > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > >>>>> placing BARs above some address.
> > > > > > >>>> 
> > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > >>>> building ACPI tables though.
> > > > > > >>> 
> > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > >>> 
> > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > >>>> 
> > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > >>> 
> > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > 
> > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > 
> > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > >>>> Thanks,
> > > > > > >>>> 
> > > > > > >>>> Alex
> > > > > > >>> 
> > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > 
> > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > where:
> > > > > > > 
> > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > 
> > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > >> 
> > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > >> 
> > > > > > >> Or did I misunderstand the question?
> > > > > > > 
> > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > back down.  Thanks,
> > > > > > 
> > > > > > I thought sizing is hard defined as a set to
> > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > 
> > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > never see a spurious address like this in my MemoryListener.
> > > > 
> > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > set to all ones atomically.
> > > > 
> > > > Also, while it doesn't address this fully (same issue can happen
> > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > from vfio / device assignment in qemu somehow?
> > > > 
> > > > In particular, even when it has sane addresses:
> > > > device really can not DMA into its own BAR, that's a spec violation
> > > > so in theory can do anything including crashing the system.
> > > > I don't know what happens in practice but
> > > > if you are programming IOMMU to forward transactions back to
> > > > device that originated it, you are not doing it any favors.
> > > 
> > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > don't.
> > 
> > Well for VFIO devices you are creating these mappings so we surely
> > can find a way for you to check that.
> > Doesn't each segment point back at the memory region that created it?
> > Then you can just check that.
> 
> It's a fairly heavy-weight search and it only avoid vfio devices, so it
> feels like it's just delaying a real solution.

Well there are several problems.

That device get its own BAR programmed
as a valid target in IOMMU is in my opinion a separate bug,
and for *that* it's a real solution.

> > >  Self-DMA is really not the intent of doing the mapping, but
> > > peer-to-peer does have merit.
> > > 
> > > > I also note that if someone tries zero copy transmit out of such an
> > > > address, get user pages will fail.
> > > > I think this means tun zero copy transmit needs to fall-back
> > > > on copy from user on get user pages failure.
> > > > 
> > > > Jason, what's tour thinking on this?
> > > > 
> > > 
> > > 
> 
>
Michael S. Tsirkin Jan. 14, 2014, 4:20 p.m. UTC | #33
On Tue, Jan 14, 2014 at 08:36:27AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 12:24 +0200, Avi Kivity wrote:
> > On 01/14/2014 12:48 AM, Alex Williamson wrote:
> > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > >>>
> > >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>>>>
> > >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>>>>>> --- a/exec.c
> > >>>>>>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>>>>>> don't detect BAR being disabled?
> > >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>>>>>> pass-through here.
> > >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Alex
> > >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>>>>>> Unfortunately
> > >>>>>>>>
> > >>>>>>>>>>>>> After this patch I get vfio
> > >>>>>>>>>>>>> traces like this:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>>>>>> (read size mask)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>>>>>> (restore BAR)
> > >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>>>>>> (memory region re-mapped)
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> > >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>>>>>> through, so how do I easily find out?
> > >>>>>>>>
> > >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>>>>>> know it's MMIO.
> > >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> > >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>>>>>> ugly.
> > >>>>>>>>
> > >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>>>>>> bar though, like ivshmem?
> > >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> > >>>>>>>
> > >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Alex
> > >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Alex
> > >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>>>>>> test suites normally check that it actually does work
> > >>>>>>>>> if it happens.
> > >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> > >>>>>>>> fixed address windows for RAM vs MMIO.
> > >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> > >>>>>>> Without that, devices can talk to each other without going
> > >>>>>>> through chipset, and bridge spec is very explicit that
> > >>>>>>> full 64 bit addressing must be supported.
> > >>>>>>>
> > >>>>>>> So as long as we don't emulate an IOMMU,
> > >>>>>>> guest will normally think it's okay to use any address.
> > >>>>>>>
> > >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>>>>>> windows would protect you, but pci already does this filtering:
> > >>>>>>>>> if you see this address in the memory map this means
> > >>>>>>>>> your virtual device is on root bus.
> > >>>>>>>>>
> > >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>>>>>> address ranges to be assigned to devices, it should give this
> > >>>>>>>>> info to qemu and qemu can give this to guest.
> > >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> > >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>>>>>> getting by because it's safely close enough to the CPU address width to
> > >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>>>>>> address space.  Maybe I can safely ignore anything above
> > >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>>>>>>
> > >>>>>>>> Alex
> > >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> > >>>>>>> So just make up your own constant, maybe depending on host architecture.
> > >>>>>>> Long term add an ioctl to query it.
> > >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > >>>>>> physical address bits of the CPU.
> > >>>>>>
> > >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>>>>>> placing BARs above some address.
> > >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> > >>>>>> building ACPI tables though.
> > >>>>> Well the point is that if you want BIOS to avoid
> > >>>>> specific addresses, you need to tell it what to avoid.
> > >>>>> But neither BIOS nor ACPI actually cover the range above
> > >>>>> 2^48 ATM so it's not a high priority.
> > >>>>>
> > >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>>>>>> lines of vfio_get_addr_space_bits(void).
> > >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > >>>>> That seems very wrong. It will still fail on an x86 host if we are
> > >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > >>>>> host side there's no real reason to tie it to the target.
> > >>> I doubt vfio would be the only thing broken in that case.
> > >>>
> > >>>>>> long term vfio already has an IOMMU info
> > >>>>>> ioctl that we could use to return this information, but we'll need to
> > >>>>>> figure out how to get it out of the IOMMU driver first.
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Alex
> > >>>>> Short term, just assume 48 bits on x86.
> > >>> I hate to pick an arbitrary value since we have a very specific mapping
> > >>> we're trying to avoid.  Perhaps a better option is to skip anything
> > >>> where:
> > >>>
> > >>>         MemoryRegionSection.offset_within_address_space >
> > >>>         ~MemoryRegionSection.offset_within_address_space
> > >>>
> > >>>>> We need to figure out what's the limitation on ppc and arm -
> > >>>>> maybe there's none and it can address full 64 bit range.
> > >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > >>>>
> > >>>> Or did I misunderstand the question?
> > >>> Sounds right, if either BAR mappings outside the window will not be
> > >>> realized in the memory space or the IOMMU has a full 64bit address
> > >>> space, there's no problem.  Here we have an intermediate step in the BAR
> > >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> > >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > >>> this then causes space and time overhead until the tables are pruned
> > >>> back down.  Thanks,
> > >> I thought sizing is hard defined as a set to
> > >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > PCI doesn't want to handle this as anything special to differentiate a
> > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > never see a spurious address like this in my MemoryListener.
> > >
> > >
> > 
> > Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
> > bios and/or linux to disable memory access while sizing.
> 
> Actually I think we need to be more stringent about DMA mapping
> failures.  If a chunk of guest RAM fails to map then we can lose data if
> the device attempts to DMA a packet into it.  How do we know which
> regions we can ignore and which we can't?  Whether or not the CPU can
> access it is a pretty good hint that we can ignore it.  Thanks,
> 
> Alex

Go ahead and use that as a hint if you prefer, but for targets which
have CPU target bits in excess of what host IOMMU supports, this might
not be enough to actually make things not break.
Alex Williamson Jan. 14, 2014, 4:39 p.m. UTC | #34
On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > > 
> > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > > 
> > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > >>> 
> > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > >>>>>>>>>>>> ---
> > > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > >>>>>>>>>>>> {
> > > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > >>>>>>>>>>>> -
> > > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > > >>>>>>>>> pass-through here.
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Unfortunately
> > > > > > > >>>>>> 
> > > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > > >>>>>>>>>>> traces like this:
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > >>>>>>>>>>> (read size mask)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > > >>>>>> through, so how do I easily find out?
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > > >>>>>>> know it's MMIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > > >>>>>> ugly.
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > > >>>>> 
> > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > > >>>>> 
> > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> Alex
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > > >>>>>>> if it happens.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > > >>>>> 
> > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > > >>>>> 
> > > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > > >>>>> 
> > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Alex
> > > > > > > >>>>> 
> > > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > > >>>> 
> > > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > > >>>> physical address bits of the CPU.
> > > > > > > >>>> 
> > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > > >>>>> placing BARs above some address.
> > > > > > > >>>> 
> > > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > > >>>> building ACPI tables though.
> > > > > > > >>> 
> > > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > > >>> 
> > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > > >>>> 
> > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > > >>> 
> > > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > > 
> > > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > > 
> > > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > > >>>> Thanks,
> > > > > > > >>>> 
> > > > > > > >>>> Alex
> > > > > > > >>> 
> > > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > > 
> > > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > > where:
> > > > > > > > 
> > > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > > 
> > > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > > >> 
> > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > > >> 
> > > > > > > >> Or did I misunderstand the question?
> > > > > > > > 
> > > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > > back down.  Thanks,
> > > > > > > 
> > > > > > > I thought sizing is hard defined as a set to
> > > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > > 
> > > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > > never see a spurious address like this in my MemoryListener.
> > > > > 
> > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > > set to all ones atomically.
> > > > > 
> > > > > Also, while it doesn't address this fully (same issue can happen
> > > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > > from vfio / device assignment in qemu somehow?
> > > > > 
> > > > > In particular, even when it has sane addresses:
> > > > > device really can not DMA into its own BAR, that's a spec violation
> > > > > so in theory can do anything including crashing the system.
> > > > > I don't know what happens in practice but
> > > > > if you are programming IOMMU to forward transactions back to
> > > > > device that originated it, you are not doing it any favors.
> > > > 
> > > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > > don't.
> > > 
> > > Well for VFIO devices you are creating these mappings so we surely
> > > can find a way for you to check that.
> > > Doesn't each segment point back at the memory region that created it?
> > > Then you can just check that.
> > 
> > It's a fairly heavy-weight search and it only avoid vfio devices, so it
> > feels like it's just delaying a real solution.
> 
> Well there are several problems.
> 
> That device get its own BAR programmed
> as a valid target in IOMMU is in my opinion a separate bug,
> and for *that* it's a real solution.

Except the side-effect of that solution is that it also disables
peer-to-peer since we do not use separate IOMMU domains per device.  In
fact, we can't guarantee that it's possible to use separate IOMMU
domains per device.  So, the cure is worse than the disease.

> > > >  Self-DMA is really not the intent of doing the mapping, but
> > > > peer-to-peer does have merit.
> > > > 
> > > > > I also note that if someone tries zero copy transmit out of such an
> > > > > address, get user pages will fail.
> > > > > I think this means tun zero copy transmit needs to fall-back
> > > > > on copy from user on get user pages failure.
> > > > > 
> > > > > Jason, what's tour thinking on this?
> > > > > 
> > > > 
> > > > 
> > 
> >
Michael S. Tsirkin Jan. 14, 2014, 4:45 p.m. UTC | #35
On Tue, Jan 14, 2014 at 09:39:24AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote:
> > On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> > > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > > > 
> > > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > > > 
> > > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > >>> 
> > > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > >>>>>>>>>>>> ---
> > > > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > > >>>>>>>>>>>> {
> > > > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > >>>>>>>>>>>> -
> > > > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > > > >>>>>>>>> pass-through here.
> > > > > > > > >>>>>>>> 
> > > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > > > >>>>>>>> 
> > > > > > > > >>>>>>>> Alex
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Unfortunately
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > > > >>>>>>>>>>> traces like this:
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > > >>>>>>>>>>> (read size mask)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > > > >>>>>> through, so how do I easily find out?
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > > > >>>>>>> know it's MMIO.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > > > >>>>>> ugly.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > > > >>>>> 
> > > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> Alex
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> Alex
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > > > >>>>>>> if it happens.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > > > >>>>> 
> > > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Alex
> > > > > > > > >>>>> 
> > > > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > > > >>>> 
> > > > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > > > >>>> physical address bits of the CPU.
> > > > > > > > >>>> 
> > > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > > > >>>>> placing BARs above some address.
> > > > > > > > >>>> 
> > > > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > > > >>>> building ACPI tables though.
> > > > > > > > >>> 
> > > > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > > > >>> 
> > > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > > > >>>> 
> > > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > > > >>> 
> > > > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > > > 
> > > > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > > > 
> > > > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > > > >>>> Thanks,
> > > > > > > > >>>> 
> > > > > > > > >>>> Alex
> > > > > > > > >>> 
> > > > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > > > 
> > > > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > > > where:
> > > > > > > > > 
> > > > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > > > 
> > > > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > > > >> 
> > > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > > > >> 
> > > > > > > > >> Or did I misunderstand the question?
> > > > > > > > > 
> > > > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > > > back down.  Thanks,
> > > > > > > > 
> > > > > > > > I thought sizing is hard defined as a set to
> > > > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > > > 
> > > > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > > > never see a spurious address like this in my MemoryListener.
> > > > > > 
> > > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > > > set to all ones atomically.
> > > > > > 
> > > > > > Also, while it doesn't address this fully (same issue can happen
> > > > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > > > from vfio / device assignment in qemu somehow?
> > > > > > 
> > > > > > In particular, even when it has sane addresses:
> > > > > > device really can not DMA into its own BAR, that's a spec violation
> > > > > > so in theory can do anything including crashing the system.
> > > > > > I don't know what happens in practice but
> > > > > > if you are programming IOMMU to forward transactions back to
> > > > > > device that originated it, you are not doing it any favors.
> > > > > 
> > > > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > > > don't.
> > > > 
> > > > Well for VFIO devices you are creating these mappings so we surely
> > > > can find a way for you to check that.
> > > > Doesn't each segment point back at the memory region that created it?
> > > > Then you can just check that.
> > > 
> > > It's a fairly heavy-weight search and it only avoid vfio devices, so it
> > > feels like it's just delaying a real solution.
> > 
> > Well there are several problems.
> > 
> > That device get its own BAR programmed
> > as a valid target in IOMMU is in my opinion a separate bug,
> > and for *that* it's a real solution.
> 
> Except the side-effect of that solution is that it also disables
> peer-to-peer since we do not use separate IOMMU domains per device.  In
> fact, we can't guarantee that it's possible to use separate IOMMU
> domains per device.

Interesting. I guess we can make it work if there's a single
device, this will cover many users though not all of them.

>  So, the cure is worse than the disease.

Worth checking what's worse. Want to try making device DMA
into its own BAR and see what crashes? It's a spec violation
so all bets are off but we can try to see at least some systems.

> > > > >  Self-DMA is really not the intent of doing the mapping, but
> > > > > peer-to-peer does have merit.
> > > > > 
> > > > > > I also note that if someone tries zero copy transmit out of such an
> > > > > > address, get user pages will fail.
> > > > > > I think this means tun zero copy transmit needs to fall-back
> > > > > > on copy from user on get user pages failure.
> > > > > > 
> > > > > > Jason, what's tour thinking on this?
> > > > > > 
> > > > > 
> > > > > 
> > > 
> > > 
> 
>
Mike D. Day Jan. 14, 2014, 5:49 p.m. UTC | #36
>> > > >>>>>>>
>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000

> If we choose not to map them, how do we distinguish them from guest RAM?
> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> that points to a chunk of guest memory from one that points to the mmap
> of a device BAR.  I think I'd need to explicitly walk all of the vfio
> device and try to match the MemoryRegion pointer to one of my devices.
> That only solves the problem for vfio devices and not ivshmem devices or
> pci-assign devices.
>

I don't know if this will save you doing your memory region search or
not. But a BAR that ends with the low bit set is MMIO, and BAR that
ends with the low bit clear is RAM. So the address above is RAM as was
pointed out earlier in the thread. If you got an ambitious address in
the future you could test the low bit. But MMIO is deprecated
according to http://wiki.osdev.org/PCI so you probably won't see it,
at least for 64-bit addresses.

Mike
Mike D. Day Jan. 14, 2014, 5:55 p.m. UTC | #37
On Tue, Jan 14, 2014 at 12:49 PM, Mike Day <ncmike@ncultra.org> wrote:
>>> > > >>>>>>>
>>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>
>> If we choose not to map them, how do we distinguish them from guest RAM?
>> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
>> that points to a chunk of guest memory from one that points to the mmap
>> of a device BAR.  I think I'd need to explicitly walk all of the vfio
>> device and try to match the MemoryRegion pointer to one of my devices.
>> That only solves the problem for vfio devices and not ivshmem devices or
>> pci-assign devices.
>>
>
> I don't know if this will save you doing your memory region search or
> not. But a BAR that ends with the low bit set is MMIO, and BAR that
> ends with the low bit clear is RAM. So the address above is RAM as was
> pointed out earlier in the thread. If you got an ambitious address in
> the future you could test the low bit. But MMIO is deprecated
> according to http://wiki.osdev.org/PCI so you probably won't see it,
> at least for 64-bit addresses.

s/ambitious/ambiguous/

The address above has already been masked. What you need to do is read
the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
'10', its RAM. If it ends in '0n' its disabled. The first thing that
the PCI software does after reading the BAR is mask off the two low
bits.

Mike
Alex Williamson Jan. 14, 2014, 6:05 p.m. UTC | #38
On Tue, 2014-01-14 at 12:55 -0500, Mike Day wrote:
> On Tue, Jan 14, 2014 at 12:49 PM, Mike Day <ncmike@ncultra.org> wrote:
> >>> > > >>>>>>>
> >>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >
> >> If we choose not to map them, how do we distinguish them from guest RAM?
> >> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> >> that points to a chunk of guest memory from one that points to the mmap
> >> of a device BAR.  I think I'd need to explicitly walk all of the vfio
> >> device and try to match the MemoryRegion pointer to one of my devices.
> >> That only solves the problem for vfio devices and not ivshmem devices or
> >> pci-assign devices.
> >>
> >
> > I don't know if this will save you doing your memory region search or
> > not. But a BAR that ends with the low bit set is MMIO, and BAR that
> > ends with the low bit clear is RAM. So the address above is RAM as was
> > pointed out earlier in the thread. If you got an ambitious address in
> > the future you could test the low bit. But MMIO is deprecated
> > according to http://wiki.osdev.org/PCI so you probably won't see it,
> > at least for 64-bit addresses.
> 
> s/ambitious/ambiguous/
> 
> The address above has already been masked. What you need to do is read
> the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
> '10', its RAM. If it ends in '0n' its disabled. The first thing that
> the PCI software does after reading the BAR is mask off the two low
> bits.

Are you perhaps confusing MMIO and I/O port?  I/O port cannot be mmap'd
on x86, so it can't be directly mapped.  It also doesn't come through
the address_space_memory filter.  I/O port is deprecated, or at least
discouraged, MMIO is not.  Thanks,

Alex
Mike D. Day Jan. 14, 2014, 6:20 p.m. UTC | #39
>>
>> The address above has already been masked. What you need to do is read
>> the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
>> '10', its RAM. If it ends in '0n' its disabled. The first thing that
>> the PCI software does after reading the BAR is mask off the two low
>> bits.
>
> Are you perhaps confusing MMIO and I/O port?  I/O port cannot be mmap'd
> on x86, so it can't be directly mapped.  It also doesn't come through
> the address_space_memory filter.  I/O port is deprecated, or at least
> discouraged, MMIO is not.  Thanks,

You're right, sorry I missed that. It doesn't solve the problem.

Mike
Alexey Kardashevskiy Jan. 15, 2014, 12:48 a.m. UTC | #40
On 01/15/2014 01:05 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:
>>
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>>
>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>
>>> Short term, just assume 48 bits on x86.
>>>
>>> We need to figure out what's the limitation on ppc and arm -
>>> maybe there's none and it can address full 64 bit range.
>>>
>>> Cc some people who might know about these platforms.
>>
>> The document you need is here: 
>>
>> http://goo.gl/fJYxdN
>>
>> "PCI Bus Binding To: IEEE Std 1275-1994"
>>
>> The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
>> and Memory mappings for BARs.
>>
>> Also, both 32-bit and 64-bit BARs are required to be supported. It is
>> legal to construct a 64-bit BAR by masking all the high bits to
>> zero. Presumably it would be OK to mask the 16 high bits to zero as
>> well, constructing a 48-bit address.
>>
>> Mike
>>
>> -- 
>> Mike Day | "Endurance is a Virtue"
> 
> The question was whether addresses such as 
> 0xfffffffffec00000 can be a valid BAR value on these
> platforms, whether it's accessible to the CPU and
> to other PCI devices.


On ppc64, the guest address is limited by 60 bits (2Alex: even PA from HPT
has the same limit) but there is no actual limit for PCI bus addresses. The
actual hardware has some (less than 60 bits but close) limits but since we
do not emulate any real PHB in qemu-spapr and do para-virtualization, we do
not have to put limits there and BARs like 0xfffffffffec00000 should be
allowed (but we do not really expect them to be as big though).
Mike D. Day Jan. 20, 2014, 4:20 p.m. UTC | #41
Do you know which device is writing to the BAR below? From the trace
it appears it should be restoring the memory address to the BAR after
writing all 1s to the BAR and reading back the contents. (the protocol
for finding the length of the bar memory.)

On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>> From: Paolo Bonzini <pbonzini@redhat.com>
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> (save lower 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> (write mask to BAR)

Here the device should restore the memory address (original contents)
to the BAR.

> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> (read size mask)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> (restore BAR)
> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> (memory region re-mapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> (save upper 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> (write mask to BAR)

and here ...

> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> (memory region gets re-mapped with new address)
> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> (iommu barfs because it can only handle 48bit physical addresses)

I looked around some but I couldn't find an obvious culprit. Could it
be that the BAR is getting unmapped automatically due to
x-intx-mmap-timeout-ms before the device has a chance to finish
restoring the correct value to the BAR?

Mike
Alex Williamson Jan. 20, 2014, 4:45 p.m. UTC | #42
On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> Do you know which device is writing to the BAR below? From the trace
> it appears it should be restoring the memory address to the BAR after
> writing all 1s to the BAR and reading back the contents. (the protocol
> for finding the length of the bar memory.)

The guest itself is writing the the BARs.  This is a standard sizing
operation by the guest.

> On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >> From: Paolo Bonzini <pbonzini@redhat.com>
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > (save lower 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > (write mask to BAR)
> 
> Here the device should restore the memory address (original contents)
> to the BAR.

Sorry if it's not clear, the trace here is what the vfio-pci driver
sees.  We're just observing the sizing operation of the guest, therefore
we see:

1) orig = read()
2) write(0xffffffff)
3) size_mask = read()
4) write(orig)

We're only at step 2)

> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > (read size mask)

step 3)

> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > (restore BAR)

step 4)

> > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > (memory region re-mapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > (save upper 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > (write mask to BAR)
> 
> and here ...

This is the same as above to the next BAR, which is the upper 32bits of
the 64bit BAR.

> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > (memory region gets re-mapped with new address)
> > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > (iommu barfs because it can only handle 48bit physical addresses)
> 
> I looked around some but I couldn't find an obvious culprit. Could it
> be that the BAR is getting unmapped automatically due to
> x-intx-mmap-timeout-ms before the device has a chance to finish
> restoring the correct value to the BAR?

No, this is simply the guest sizing the BAR, this is not an internally
generated operation.  The INTx emulation isn't used here as KVM
acceleration is enabled.  That also only toggles the enable setting on
the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
Thanks,

Alex
Michael S. Tsirkin Jan. 20, 2014, 5:04 p.m. UTC | #43
On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > Do you know which device is writing to the BAR below? From the trace
> > it appears it should be restoring the memory address to the BAR after
> > writing all 1s to the BAR and reading back the contents. (the protocol
> > for finding the length of the bar memory.)
> 
> The guest itself is writing the the BARs.  This is a standard sizing
> operation by the guest.

Question is maybe device memory should be disabled?
Does windows do this too (sizing when memory enabled)?


> > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > (save lower 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > 
> > Here the device should restore the memory address (original contents)
> > to the BAR.
> 
> Sorry if it's not clear, the trace here is what the vfio-pci driver
> sees.  We're just observing the sizing operation of the guest, therefore
> we see:
> 
> 1) orig = read()
> 2) write(0xffffffff)
> 3) size_mask = read()
> 4) write(orig)
> 
> We're only at step 2)
> 
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > (read size mask)
> 
> step 3)
> 
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > (restore BAR)
> 
> step 4)
> 
> > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > (memory region re-mapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > (save upper 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > 
> > and here ...
> 
> This is the same as above to the next BAR, which is the upper 32bits of
> the 64bit BAR.
> 
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > (memory region gets re-mapped with new address)
> > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > (iommu barfs because it can only handle 48bit physical addresses)
> > 
> > I looked around some but I couldn't find an obvious culprit. Could it
> > be that the BAR is getting unmapped automatically due to
> > x-intx-mmap-timeout-ms before the device has a chance to finish
> > restoring the correct value to the BAR?
> 
> No, this is simply the guest sizing the BAR, this is not an internally
> generated operation.  The INTx emulation isn't used here as KVM
> acceleration is enabled.  That also only toggles the enable setting on
> the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> Thanks,
> 
> Alex
Alex Williamson Jan. 20, 2014, 5:16 p.m. UTC | #44
On Mon, 2014-01-20 at 19:04 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> > On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > > Do you know which device is writing to the BAR below? From the trace
> > > it appears it should be restoring the memory address to the BAR after
> > > writing all 1s to the BAR and reading back the contents. (the protocol
> > > for finding the length of the bar memory.)
> > 
> > The guest itself is writing the the BARs.  This is a standard sizing
> > operation by the guest.
> 
> Question is maybe device memory should be disabled?
> Does windows do this too (sizing when memory enabled)?

Per the spec I would have expected memory & I/O to be disabled on the
device during a sizing operation, but that's not the case here.  I
thought you were the one that said Linux doesn't do this because some
devices don't properly re-enable.  I'm not sure how it would change our
approach to this to know whether Windows behaves the same since sizing
while disabled is not an issue and we apparently need to support sizing
while enabled regardless.  Thanks,

Alex

> > > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > > <alex.williamson@redhat.com> wrote:
> > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > (save lower 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > 
> > > Here the device should restore the memory address (original contents)
> > > to the BAR.
> > 
> > Sorry if it's not clear, the trace here is what the vfio-pci driver
> > sees.  We're just observing the sizing operation of the guest, therefore
> > we see:
> > 
> > 1) orig = read()
> > 2) write(0xffffffff)
> > 3) size_mask = read()
> > 4) write(orig)
> > 
> > We're only at step 2)
> > 
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > (read size mask)
> > 
> > step 3)
> > 
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > (restore BAR)
> > 
> > step 4)
> > 
> > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > (memory region re-mapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > (save upper 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > 
> > > and here ...
> > 
> > This is the same as above to the next BAR, which is the upper 32bits of
> > the 64bit BAR.
> > 
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > (memory region gets re-mapped with new address)
> > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > 
> > > I looked around some but I couldn't find an obvious culprit. Could it
> > > be that the BAR is getting unmapped automatically due to
> > > x-intx-mmap-timeout-ms before the device has a chance to finish
> > > restoring the correct value to the BAR?
> > 
> > No, this is simply the guest sizing the BAR, this is not an internally
> > generated operation.  The INTx emulation isn't used here as KVM
> > acceleration is enabled.  That also only toggles the enable setting on
> > the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> > Thanks,
> > 
> > Alex
Michael S. Tsirkin Jan. 20, 2014, 8:37 p.m. UTC | #45
On Mon, Jan 20, 2014 at 10:16:01AM -0700, Alex Williamson wrote:
> On Mon, 2014-01-20 at 19:04 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> > > On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > > > Do you know which device is writing to the BAR below? From the trace
> > > > it appears it should be restoring the memory address to the BAR after
> > > > writing all 1s to the BAR and reading back the contents. (the protocol
> > > > for finding the length of the bar memory.)
> > > 
> > > The guest itself is writing the the BARs.  This is a standard sizing
> > > operation by the guest.
> > 
> > Question is maybe device memory should be disabled?
> > Does windows do this too (sizing when memory enabled)?
> 
> Per the spec I would have expected memory & I/O to be disabled on the
> device during a sizing operation, but that's not the case here.  I
> thought you were the one that said Linux doesn't do this because some
> devices don't properly re-enable.

Yes. But maybe we can white-list devices or something.
I'm guessing modern express devices are all sane
and let you disable/enable memory any number
of times.

> I'm not sure how it would change our
> approach to this to know whether Windows behaves the same since sizing
> while disabled is not an issue and we apparently need to support sizing
> while enabled regardless.  Thanks,
> 
> Alex

I'm talking about changing Linux here.
If windows is already doing this - this gives us more
hope that this will actually work.
Yes we need the work-around in qemu regardless.


> > > > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > > > <alex.williamson@redhat.com> wrote:
> > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > (save lower 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > 
> > > > Here the device should restore the memory address (original contents)
> > > > to the BAR.
> > > 
> > > Sorry if it's not clear, the trace here is what the vfio-pci driver
> > > sees.  We're just observing the sizing operation of the guest, therefore
> > > we see:
> > > 
> > > 1) orig = read()
> > > 2) write(0xffffffff)
> > > 3) size_mask = read()
> > > 4) write(orig)
> > > 
> > > We're only at step 2)
> > > 
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > (read size mask)
> > > 
> > > step 3)
> > > 
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > (restore BAR)
> > > 
> > > step 4)
> > > 
> > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > (memory region re-mapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > (save upper 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > 
> > > > and here ...
> > > 
> > > This is the same as above to the next BAR, which is the upper 32bits of
> > > the 64bit BAR.
> > > 
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > (memory region gets re-mapped with new address)
> > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > 
> > > > I looked around some but I couldn't find an obvious culprit. Could it
> > > > be that the BAR is getting unmapped automatically due to
> > > > x-intx-mmap-timeout-ms before the device has a chance to finish
> > > > restoring the correct value to the BAR?
> > > 
> > > No, this is simply the guest sizing the BAR, this is not an internally
> > > generated operation.  The INTx emulation isn't used here as KVM
> > > acceleration is enabled.  That also only toggles the enable setting on
> > > the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> > > Thanks,
> > > 
> > > Alex
> 
>
diff mbox

Patch

diff becomes negative, and int128_get64 booms.

The size of the PCI address space region should be fixed anyway.

Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 7e5ce93..f907f5f 100644
--- a/exec.c
+++ b/exec.c
@@ -94,7 +94,7 @@  struct PhysPageEntry {
 #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
 
 /* Size of the L2 (and L3, etc) page tables.  */
-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+#define ADDR_SPACE_BITS 64
 
 #define P_L2_BITS 10
 #define P_L2_SIZE (1 << P_L2_BITS)
@@ -1861,11 +1861,7 @@  static void memory_map_init(void)
 {
     system_memory = g_malloc(sizeof(*system_memory));
 
-    assert(ADDR_SPACE_BITS <= 64);
-
-    memory_region_init(system_memory, NULL, "system",
-                       ADDR_SPACE_BITS == 64 ?
-                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
+    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");
 
     system_io = g_malloc(sizeof(*system_io));