diff mbox series

[qemu,RFC,5/7] spapr-iommu: Always advertise the maximum possible DMA window size

Message ID 20181113083104.2692-6-aik@ozlabs.ru
State New
Headers show
Series spapr_pci, vfio: NVIDIA V100 + P9 passthrough | expand

Commit Message

Alexey Kardashevskiy Nov. 13, 2018, 8:31 a.m. UTC
When deciding about the huge DMA window, the typical Linux pseries guest
uses the maximum allowed RAM size as the upper limit. We did the same
on QEMU side to match that logic. Now we are going to support GPU RAM
pass through which is not available at the guest boot time as it requires
the guest driver interaction. As the result, the guest requests a smaller
window than it should. Therefore the guest needs to be patched to
understand this new memory and so does QEMU.

Instead of reimplementing here whatever solution we will choose for
the guest, this advertises the biggest possible window size limited by
32 bit (as defined by LoPAPR).

This seems to be safe as:
1. The guest visible emulated table is allocated in KVM (actual pages
are allocated in page fault handler) and QEMU (actual pages are allocated
when changed);
2. The hardware table (and corresponding userspace addresses cache)
supports sparse allocation and also checks for locked_vm limit so
it is unable to cause the host any damage.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_rtas_ddw.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

Comments

David Gibson Nov. 19, 2018, 2:42 a.m. UTC | #1
On Tue, Nov 13, 2018 at 07:31:02PM +1100, Alexey Kardashevskiy wrote:
> When deciding about the huge DMA window, the typical Linux pseries guest
> uses the maximum allowed RAM size as the upper limit. We did the same
> on QEMU side to match that logic. Now we are going to support GPU RAM
> pass through which is not available at the guest boot time as it requires
> the guest driver interaction. As the result, the guest requests a smaller
> window than it should. Therefore the guest needs to be patched to
> understand this new memory and so does QEMU.
> 
> Instead of reimplementing here whatever solution we will choose for
> the guest, this advertises the biggest possible window size limited by
> 32 bit (as defined by LoPAPR).
> 
> This seems to be safe as:
> 1. The guest visible emulated table is allocated in KVM (actual pages
> are allocated in page fault handler) and QEMU (actual pages are allocated
> when changed);
> 2. The hardware table (and corresponding userspace addresses cache)
> supports sparse allocation and also checks for locked_vm limit so
> it is unable to cause the host any damage.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This seems like a good idea to me, even without the NPU stuff.  It
always bothered me slightly that we based what's effectively the IOVA
limit on the guest RAM size which doesn't have any direct connection
to it.

As long as it doesn't hit the locked memory limits, I don't see any
reason we should prevent a guest from doing something weird like
mapping a small bit of RAM over and over in IOVA space, or mapping its
RAM sparsely in IOVA space.

> ---
>  hw/ppc/spapr_rtas_ddw.c | 19 +++----------------
>  1 file changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> index 329feb1..df60351 100644
> --- a/hw/ppc/spapr_rtas_ddw.c
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -96,9 +96,8 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>                                           uint32_t nret, target_ulong rets)
>  {
>      sPAPRPHBState *sphb;
> -    uint64_t buid, max_window_size;
> +    uint64_t buid;
>      uint32_t avail, addr, pgmask = 0;
> -    MachineState *machine = MACHINE(spapr);
>  
>      if ((nargs != 3) || (nret != 5)) {
>          goto param_error_exit;
> @@ -114,27 +113,15 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>      /* Translate page mask to LoPAPR format */
>      pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>  
> -    /*
> -     * This is "Largest contiguous block of TCEs allocated specifically
> -     * for (that is, are reserved for) this PE".
> -     * Return the maximum number as maximum supported RAM size was in 4K pages.
> -     */
> -    if (machine->ram_size == machine->maxram_size) {
> -        max_window_size = machine->ram_size;
> -    } else {
> -        max_window_size = machine->device_memory->base +
> -                          memory_region_size(&machine->device_memory->mr);
> -    }
> -
>      avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>  
>      rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>      rtas_st(rets, 1, avail);
> -    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 2, 0xFFFFFFFF); /* Largest contiguous block of TCEs */

One detail though.. where does this limit actually come from?  Is it
actually a limit imposed by the hardware somewhere, or just because
the RTAS call doesn't ahve room for anything more?

Previously limits would usually have been a power of 2; is it likely
to matter that now it won't be?

>      rtas_st(rets, 3, pgmask);
>      rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>  
> -    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, 0xFFFFFFFF, pgmask);
>      return;
>  
>  param_error_exit:
Alexey Kardashevskiy Nov. 19, 2018, 5:08 a.m. UTC | #2
On 19/11/2018 13:42, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:31:02PM +1100, Alexey Kardashevskiy wrote:
>> When deciding about the huge DMA window, the typical Linux pseries guest
>> uses the maximum allowed RAM size as the upper limit. We did the same
>> on QEMU side to match that logic. Now we are going to support GPU RAM
>> pass through which is not available at the guest boot time as it requires
>> the guest driver interaction. As the result, the guest requests a smaller
>> window than it should. Therefore the guest needs to be patched to
>> understand this new memory and so does QEMU.
>>
>> Instead of reimplementing here whatever solution we will choose for
>> the guest, this advertises the biggest possible window size limited by
>> 32 bit (as defined by LoPAPR).
>>
>> This seems to be safe as:
>> 1. The guest visible emulated table is allocated in KVM (actual pages
>> are allocated in page fault handler) and QEMU (actual pages are allocated
>> when changed);
>> 2. The hardware table (and corresponding userspace addresses cache)
>> supports sparse allocation and also checks for locked_vm limit so
>> it is unable to cause the host any damage.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> This seems like a good idea to me, even without the NPU stuff.  It
> always bothered me slightly that we based what's effectively the IOVA
> limit on the guest RAM size which doesn't have any direct connection
> to it.
> 
> As long as it doesn't hit the locked memory limits, I don't see any
> reason we should prevent a guest from doing something weird like
> mapping a small bit of RAM over and over in IOVA space, or mapping its
> RAM sparsely in IOVA space.
> 
>> ---
>>  hw/ppc/spapr_rtas_ddw.c | 19 +++----------------
>>  1 file changed, 3 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> index 329feb1..df60351 100644
>> --- a/hw/ppc/spapr_rtas_ddw.c
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -96,9 +96,8 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>                                           uint32_t nret, target_ulong rets)
>>  {
>>      sPAPRPHBState *sphb;
>> -    uint64_t buid, max_window_size;
>> +    uint64_t buid;
>>      uint32_t avail, addr, pgmask = 0;
>> -    MachineState *machine = MACHINE(spapr);
>>  
>>      if ((nargs != 3) || (nret != 5)) {
>>          goto param_error_exit;
>> @@ -114,27 +113,15 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>      /* Translate page mask to LoPAPR format */
>>      pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>>  
>> -    /*
>> -     * This is "Largest contiguous block of TCEs allocated specifically
>> -     * for (that is, are reserved for) this PE".
>> -     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> -     */
>> -    if (machine->ram_size == machine->maxram_size) {
>> -        max_window_size = machine->ram_size;
>> -    } else {
>> -        max_window_size = machine->device_memory->base +
>> -                          memory_region_size(&machine->device_memory->mr);
>> -    }
>> -
>>      avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>>  
>>      rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>      rtas_st(rets, 1, avail);
>> -    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 2, 0xFFFFFFFF); /* Largest contiguous block of TCEs */
> 
> One detail though.. where does this limit actually come from?  Is it
> actually a limit imposed by the hardware somewhere, or just because
> the RTAS call doesn't ahve room for anything more?

I used this limit just because of the parameter size.

> 
> Previously limits would usually have been a power of 2; is it likely
> to matter that now it won't be?

LoPAPR says it is "Largest contiguous block of TCEs" and no mention of
power of two but ibm,create-pe-dma-window takes a window shift so it is
assumed that windows can still be only power of two. Not sure. The
powernv code will fail if the requested window is not power of two.

Replacing 0xffffffff with 0x80000000 won't make a difference though here
but probably will make the error clearer...




> 
>>      rtas_st(rets, 3, pgmask);
>>      rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>>  
>> -    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, 0xFFFFFFFF, pgmask);
>>      return;
>>  
>>  param_error_exit:
>
David Gibson Nov. 19, 2018, 5:31 a.m. UTC | #3
On Mon, Nov 19, 2018 at 04:08:33PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 19/11/2018 13:42, David Gibson wrote:
> > On Tue, Nov 13, 2018 at 07:31:02PM +1100, Alexey Kardashevskiy wrote:
> >> When deciding about the huge DMA window, the typical Linux pseries guest
> >> uses the maximum allowed RAM size as the upper limit. We did the same
> >> on QEMU side to match that logic. Now we are going to support GPU RAM
> >> pass through which is not available at the guest boot time as it requires
> >> the guest driver interaction. As the result, the guest requests a smaller
> >> window than it should. Therefore the guest needs to be patched to
> >> understand this new memory and so does QEMU.
> >>
> >> Instead of reimplementing here whatever solution we will choose for
> >> the guest, this advertises the biggest possible window size limited by
> >> 32 bit (as defined by LoPAPR).
> >>
> >> This seems to be safe as:
> >> 1. The guest visible emulated table is allocated in KVM (actual pages
> >> are allocated in page fault handler) and QEMU (actual pages are allocated
> >> when changed);
> >> 2. The hardware table (and corresponding userspace addresses cache)
> >> supports sparse allocation and also checks for locked_vm limit so
> >> it is unable to cause the host any damage.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > This seems like a good idea to me, even without the NPU stuff.  It
> > always bothered me slightly that we based what's effectively the IOVA
> > limit on the guest RAM size which doesn't have any direct connection
> > to it.
> > 
> > As long as it doesn't hit the locked memory limits, I don't see any
> > reason we should prevent a guest from doing something weird like
> > mapping a small bit of RAM over and over in IOVA space, or mapping its
> > RAM sparsely in IOVA space.
> > 
> >> ---
> >>  hw/ppc/spapr_rtas_ddw.c | 19 +++----------------
> >>  1 file changed, 3 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >> index 329feb1..df60351 100644
> >> --- a/hw/ppc/spapr_rtas_ddw.c
> >> +++ b/hw/ppc/spapr_rtas_ddw.c
> >> @@ -96,9 +96,8 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >>                                           uint32_t nret, target_ulong rets)
> >>  {
> >>      sPAPRPHBState *sphb;
> >> -    uint64_t buid, max_window_size;
> >> +    uint64_t buid;
> >>      uint32_t avail, addr, pgmask = 0;
> >> -    MachineState *machine = MACHINE(spapr);
> >>  
> >>      if ((nargs != 3) || (nret != 5)) {
> >>          goto param_error_exit;
> >> @@ -114,27 +113,15 @@ static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >>      /* Translate page mask to LoPAPR format */
> >>      pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> >>  
> >> -    /*
> >> -     * This is "Largest contiguous block of TCEs allocated specifically
> >> -     * for (that is, are reserved for) this PE".
> >> -     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >> -     */
> >> -    if (machine->ram_size == machine->maxram_size) {
> >> -        max_window_size = machine->ram_size;
> >> -    } else {
> >> -        max_window_size = machine->device_memory->base +
> >> -                          memory_region_size(&machine->device_memory->mr);
> >> -    }
> >> -
> >>      avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> >>  
> >>      rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>      rtas_st(rets, 1, avail);
> >> -    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> >> +    rtas_st(rets, 2, 0xFFFFFFFF); /* Largest contiguous block of TCEs */
> > 
> > One detail though.. where does this limit actually come from?  Is it
> > actually a limit imposed by the hardware somewhere, or just because
> > the RTAS call doesn't ahve room for anything more?
> 
> I used this limit just because of the parameter size.
> 
> > 
> > Previously limits would usually have been a power of 2; is it likely
> > to matter that now it won't be?
> 
> LoPAPR says it is "Largest contiguous block of TCEs" and no mention of
> power of two but ibm,create-pe-dma-window takes a window shift so it is
> assumed that windows can still be only power of two. Not sure. The
> powernv code will fail if the requested window is not power of two.
> 
> Replacing 0xffffffff with 0x80000000 won't make a difference though here
> but probably will make the error clearer...

Yeah, given that the windows have to be a power of 2 in size, I think
it makes sense for this to be a power of 2, even if it doesn't
strictly have to be.  So I think 0x80000000 would be a better option.
diff mbox series

Patch

diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
index 329feb1..df60351 100644
--- a/hw/ppc/spapr_rtas_ddw.c
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -96,9 +96,8 @@  static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
                                          uint32_t nret, target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    uint64_t buid, max_window_size;
+    uint64_t buid;
     uint32_t avail, addr, pgmask = 0;
-    MachineState *machine = MACHINE(spapr);
 
     if ((nargs != 3) || (nret != 5)) {
         goto param_error_exit;
@@ -114,27 +113,15 @@  static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
     /* Translate page mask to LoPAPR format */
     pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
 
-    /*
-     * This is "Largest contiguous block of TCEs allocated specifically
-     * for (that is, are reserved for) this PE".
-     * Return the maximum number as maximum supported RAM size was in 4K pages.
-     */
-    if (machine->ram_size == machine->maxram_size) {
-        max_window_size = machine->ram_size;
-    } else {
-        max_window_size = machine->device_memory->base +
-                          memory_region_size(&machine->device_memory->mr);
-    }
-
     avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
 
     rtas_st(rets, 0, RTAS_OUT_SUCCESS);
     rtas_st(rets, 1, avail);
-    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 2, 0xFFFFFFFF); /* Largest contiguous block of TCEs */
     rtas_st(rets, 3, pgmask);
     rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
 
-    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    trace_spapr_iommu_ddw_query(buid, addr, avail, 0xFFFFFFFF, pgmask);
     return;
 
 param_error_exit: