mbox series

[qemu,0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough

Message ID 20190117025115.81178-1-aik@ozlabs.ru
Headers show
Series spapr_pci, vfio: NVIDIA V100 + P9 passthrough | expand

Message

Alexey Kardashevskiy Jan. 17, 2019, 2:51 a.m. UTC
This is for passing through NVIDIA V100 GPUs on POWER9 systems.

This implements a subdriver for NVIDIA V100 GPU with coherent memory and
NPU/ATS support available in the POWER9 CPU.

1/3 is not strictly related but since new memory also needs to be mapped
to the 64bit DMA window and it is located quite high in the address space,
some adjustments are needed.


This is based on dwg/ppc-for-4.0 sha1 a0a8bff and requires headers update
from v5.0-rc1 staged by Paolo already.

Please comment. Thanks.



Alexey Kardashevskiy (3):
  vfio/spapr: Fix indirect levels calculation
  vfio: Make vfio_get_region_info_cap public
  spapr: Support NVIDIA V100 GPU with NVLink2

 hw/vfio/pci.h                 |   2 +
 include/hw/pci-host/spapr.h   |   9 +
 include/hw/ppc/spapr.h        |   3 +-
 include/hw/vfio/vfio-common.h |   2 +
 hw/ppc/spapr.c                |  25 ++-
 hw/ppc/spapr_pci.c            | 333 +++++++++++++++++++++++++++++++++-
 hw/vfio/common.c              |   2 +-
 hw/vfio/pci-quirks.c          | 120 ++++++++++++
 hw/vfio/pci.c                 |  14 ++
 hw/vfio/spapr.c               |  38 +++-
 hw/vfio/trace-events          |   6 +-
 11 files changed, 539 insertions(+), 15 deletions(-)

Comments

Alexey Kardashevskiy Feb. 3, 2019, 11:59 p.m. UTC | #1
On 17/01/2019 13:51, Alexey Kardashevskiy wrote:
> This is for passing through NVIDIA V100 GPUs on POWER9 systems.
> 
> This implements a subdriver for NVIDIA V100 GPU with coherent memory and
> NPU/ATS support available in the POWER9 CPU.
> 
> 1/3 is not strictly related but since new memory also needs to be mapped
> to the 64bit DMA window and it is located quite high in the address space,
> some adjustments are needed.
> 
> 
> This is based on dwg/ppc-for-4.0 sha1 a0a8bff and requires headers update
> from v5.0-rc1 staged by Paolo already.
> 
> Please comment. Thanks.


Ping?

> 
> 
> 
> Alexey Kardashevskiy (3):
>   vfio/spapr: Fix indirect levels calculation
>   vfio: Make vfio_get_region_info_cap public
>   spapr: Support NVIDIA V100 GPU with NVLink2
> 
>  hw/vfio/pci.h                 |   2 +
>  include/hw/pci-host/spapr.h   |   9 +
>  include/hw/ppc/spapr.h        |   3 +-
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/ppc/spapr.c                |  25 ++-
>  hw/ppc/spapr_pci.c            | 333 +++++++++++++++++++++++++++++++++-
>  hw/vfio/common.c              |   2 +-
>  hw/vfio/pci-quirks.c          | 120 ++++++++++++
>  hw/vfio/pci.c                 |  14 ++
>  hw/vfio/spapr.c               |  38 +++-
>  hw/vfio/trace-events          |   6 +-
>  11 files changed, 539 insertions(+), 15 deletions(-)
>
Daniel Henrique Barboza Feb. 6, 2019, 5:22 p.m. UTC | #2
Based on this series, I've sent a Libvirt patch to allow a QEMU process
to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
GPU:

https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html


In that thread, Alex raised concerns about allowing QEMU to freely lock
all the memory it wants. Is this an issue to be considered in the review
of this series here?

Reading the patches, specially patch 3/3, it seems to me that QEMU is
going to lock the KVM memory to populate the NUMA node with memory
of the GPU itself, so at first there is no risk of not taking over the 
host RAM.
Am I missing something?


Thanks,


DHB


On 1/17/19 12:51 AM, Alexey Kardashevskiy wrote:
> This is for passing through NVIDIA V100 GPUs on POWER9 systems.
>
> This implements a subdriver for NVIDIA V100 GPU with coherent memory and
> NPU/ATS support available in the POWER9 CPU.
>
> 1/3 is not strictly related but since new memory also needs to be mapped
> to the 64bit DMA window and it is located quite high in the address space,
> some adjustments are needed.
>
>
> This is based on dwg/ppc-for-4.0 sha1 a0a8bff and requires headers update
> from v5.0-rc1 staged by Paolo already.
>
> Please comment. Thanks.
>
>
>
> Alexey Kardashevskiy (3):
>    vfio/spapr: Fix indirect levels calculation
>    vfio: Make vfio_get_region_info_cap public
>    spapr: Support NVIDIA V100 GPU with NVLink2
>
>   hw/vfio/pci.h                 |   2 +
>   include/hw/pci-host/spapr.h   |   9 +
>   include/hw/ppc/spapr.h        |   3 +-
>   include/hw/vfio/vfio-common.h |   2 +
>   hw/ppc/spapr.c                |  25 ++-
>   hw/ppc/spapr_pci.c            | 333 +++++++++++++++++++++++++++++++++-
>   hw/vfio/common.c              |   2 +-
>   hw/vfio/pci-quirks.c          | 120 ++++++++++++
>   hw/vfio/pci.c                 |  14 ++
>   hw/vfio/spapr.c               |  38 +++-
>   hw/vfio/trace-events          |   6 +-
>   11 files changed, 539 insertions(+), 15 deletions(-)
>
Alexey Kardashevskiy Feb. 7, 2019, 4:43 a.m. UTC | #3
On 07/02/2019 04:22, Daniel Henrique Barboza wrote:
> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> GPU:
> 
> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> 
> 
> In that thread, Alex raised concerns about allowing QEMU to freely lock
> all the memory it wants. Is this an issue to be considered in the review
> of this series here?
> 
> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> going to lock the KVM memory to populate the NUMA node with memory
> of the GPU itself, so at first there is no risk of not taking over the
> host RAM.
> Am I missing something?


The GPU memory belongs to the device and not visible to the host as
memory blocks and not covered by page structs, for the host it is more
like MMIO which is passed through to the guest without that locked
accounting, I'd expect libvirt to keep working as usual except that:

when libvirt calculates the amount of memory needed for TCE tables
(which is guestRAM/64k*8), now it needs to use the end of the last GPU
RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":

FlatView #2
 AS "memory", root: system
 AS "cpu-memory-0", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr

So previously the DMA window would cover 0x7fffffff+1, now it has to
cover 0x11fffffffff+1.


> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> On 1/17/19 12:51 AM, Alexey Kardashevskiy wrote:
>> This is for passing through NVIDIA V100 GPUs on POWER9 systems.
>>
>> This implements a subdriver for NVIDIA V100 GPU with coherent memory and
>> NPU/ATS support available in the POWER9 CPU.
>>
>> 1/3 is not strictly related but since new memory also needs to be mapped
>> to the 64bit DMA window and it is located quite high in the address space,
>> some adjustments are needed.
>>
>>
>> This is based on dwg/ppc-for-4.0 sha1 a0a8bff and requires headers update
>> from v5.0-rc1 staged by Paolo already.
>>
>> Please comment. Thanks.
>>
>>
>>
>> Alexey Kardashevskiy (3):
>>   vfio/spapr: Fix indirect levels calculation
>>   vfio: Make vfio_get_region_info_cap public
>>   spapr: Support NVIDIA V100 GPU with NVLink2
>>
>>  hw/vfio/pci.h                 |   2 +
>>  include/hw/pci-host/spapr.h   |   9 +
>>  include/hw/ppc/spapr.h        |   3 +-
>>  include/hw/vfio/vfio-common.h |   2 +
>>  hw/ppc/spapr.c                |  25 ++-
>>  hw/ppc/spapr_pci.c            | 333 +++++++++++++++++++++++++++++++++-
>>  hw/vfio/common.c              |   2 +-
>>  hw/vfio/pci-quirks.c          | 120 ++++++++++++
>>  hw/vfio/pci.c                 |  14 ++
>>  hw/vfio/spapr.c               |  38 +++-
>>  hw/vfio/trace-events          |   6 +-
>>  11 files changed, 539 insertions(+), 15 deletions(-)
>>
>
Alex Williamson Feb. 7, 2019, 3:18 p.m. UTC | #4
On Thu, 7 Feb 2019 15:43:18 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:
> > Based on this series, I've sent a Libvirt patch to allow a QEMU process
> > to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> > GPU:
> > 
> > https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> > 
> > 
> > In that thread, Alex raised concerns about allowing QEMU to freely lock
> > all the memory it wants. Is this an issue to be considered in the review
> > of this series here?
> > 
> > Reading the patches, specially patch 3/3, it seems to me that QEMU is
> > going to lock the KVM memory to populate the NUMA node with memory
> > of the GPU itself, so at first there is no risk of not taking over the
> > host RAM.
> > Am I missing something?  
> 
> 
> The GPU memory belongs to the device and not visible to the host as
> memory blocks and not covered by page structs, for the host it is more
> like MMIO which is passed through to the guest without that locked
> accounting, I'd expect libvirt to keep working as usual except that:
> 
> when libvirt calculates the amount of memory needed for TCE tables
> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> 
> FlatView #2
>  AS "memory", root: system
>  AS "cpu-memory-0", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> 
> So previously the DMA window would cover 0x7fffffff+1, now it has to
> cover 0x11fffffffff+1.

This looks like a chicken and egg problem, you're saying libvirt needs
to query mtree to understand the extent of the GPU layout, but we need
to specify the locked memory limits in order for QEMU to start?  Is
libvirt supposed to start the VM with unlimited locked memory and fix
it at some indeterminate point in the future?  Run a dummy VM with
unlimited locked memory in order to determine the limits for the real
VM?  Neither of these sound practical.  Thanks,

Alex
Alexey Kardashevskiy Feb. 8, 2019, 2:29 a.m. UTC | #5
On 08/02/2019 02:18, Alex Williamson wrote:
> On Thu, 7 Feb 2019 15:43:18 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:
>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
>>> GPU:
>>>
>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
>>>
>>>
>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
>>> all the memory it wants. Is this an issue to be considered in the review
>>> of this series here?
>>>
>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
>>> going to lock the KVM memory to populate the NUMA node with memory
>>> of the GPU itself, so at first there is no risk of not taking over the
>>> host RAM.
>>> Am I missing something?  
>>
>>
>> The GPU memory belongs to the device and not visible to the host as
>> memory blocks and not covered by page structs, for the host it is more
>> like MMIO which is passed through to the guest without that locked
>> accounting, I'd expect libvirt to keep working as usual except that:
>>
>> when libvirt calculates the amount of memory needed for TCE tables
>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
>>
>> FlatView #2
>>  AS "memory", root: system
>>  AS "cpu-memory-0", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
>>
>> So previously the DMA window would cover 0x7fffffff+1, now it has to
>> cover 0x11fffffffff+1.
> 
> This looks like a chicken and egg problem, you're saying libvirt needs
> to query mtree to understand the extent of the GPU layout, but we need
> to specify the locked memory limits in order for QEMU to start?  Is
> libvirt supposed to start the VM with unlimited locked memory and fix
> it at some indeterminate point in the future?  Run a dummy VM with
> unlimited locked memory in order to determine the limits for the real
> VM?  Neither of these sound practical.  Thanks,


QEMU maps GPU RAM at known locations (which only depends on the vPHB's
index or can be set explicitely) and libvirt knows how many GPUs are
passed so it is quite easy to calculate the required amount of memory.

Here is the window start calculation:
https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812

We do not exactly know the GPU RAM window size until QEMU reads it from
VFIO/nvlink2 but we know that all existing hardware has a window of
128GB (the adapters I have access to only have 16/32GB on board).
Alex Williamson Feb. 8, 2019, 3:26 a.m. UTC | #6
On Fri, 8 Feb 2019 13:29:37 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 08/02/2019 02:18, Alex Williamson wrote:
> > On Thu, 7 Feb 2019 15:43:18 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:  
> >>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> >>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> >>> GPU:
> >>>
> >>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> >>>
> >>>
> >>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> >>> all the memory it wants. Is this an issue to be considered in the review
> >>> of this series here?
> >>>
> >>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> >>> going to lock the KVM memory to populate the NUMA node with memory
> >>> of the GPU itself, so at first there is no risk of not taking over the
> >>> host RAM.
> >>> Am I missing something?    
> >>
> >>
> >> The GPU memory belongs to the device and not visible to the host as
> >> memory blocks and not covered by page structs, for the host it is more
> >> like MMIO which is passed through to the guest without that locked
> >> accounting, I'd expect libvirt to keep working as usual except that:
> >>
> >> when libvirt calculates the amount of memory needed for TCE tables
> >> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> >> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> >>
> >> FlatView #2
> >>  AS "memory", root: system
> >>  AS "cpu-memory-0", root: system
> >>  Root memory region: system
> >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> >>
> >> So previously the DMA window would cover 0x7fffffff+1, now it has to
> >> cover 0x11fffffffff+1.  
> > 
> > This looks like a chicken and egg problem, you're saying libvirt needs
> > to query mtree to understand the extent of the GPU layout, but we need
> > to specify the locked memory limits in order for QEMU to start?  Is
> > libvirt supposed to start the VM with unlimited locked memory and fix
> > it at some indeterminate point in the future?  Run a dummy VM with
> > unlimited locked memory in order to determine the limits for the real
> > VM?  Neither of these sound practical.  Thanks,  
> 
> 
> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> index or can be set explicitely) and libvirt knows how many GPUs are
> passed so it is quite easy to calculate the required amount of memory.
> 
> Here is the window start calculation:
> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> 
> We do not exactly know the GPU RAM window size until QEMU reads it from
> VFIO/nvlink2 but we know that all existing hardware has a window of
> 128GB (the adapters I have access to only have 16/32GB on board).

So you're asking that libvirt add 128GB per GPU with magic nvlink
properties, which may be 8x what's actually necessary and libvirt
determines which GPUs to apply this to how?  Does libvirt need to sort
through device tree properties for this?  Thanks,

Alex
David Gibson Feb. 8, 2019, 5:28 a.m. UTC | #7
On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:
> On Fri, 8 Feb 2019 13:29:37 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 08/02/2019 02:18, Alex Williamson wrote:
> > > On Thu, 7 Feb 2019 15:43:18 +1100
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > >> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:  
> > >>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> > >>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> > >>> GPU:
> > >>>
> > >>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> > >>>
> > >>>
> > >>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> > >>> all the memory it wants. Is this an issue to be considered in the review
> > >>> of this series here?
> > >>>
> > >>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> > >>> going to lock the KVM memory to populate the NUMA node with memory
> > >>> of the GPU itself, so at first there is no risk of not taking over the
> > >>> host RAM.
> > >>> Am I missing something?    
> > >>
> > >>
> > >> The GPU memory belongs to the device and not visible to the host as
> > >> memory blocks and not covered by page structs, for the host it is more
> > >> like MMIO which is passed through to the guest without that locked
> > >> accounting, I'd expect libvirt to keep working as usual except that:
> > >>
> > >> when libvirt calculates the amount of memory needed for TCE tables
> > >> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> > >> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> > >>
> > >> FlatView #2
> > >>  AS "memory", root: system
> > >>  AS "cpu-memory-0", root: system
> > >>  Root memory region: system
> > >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> > >>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> > >>
> > >> So previously the DMA window would cover 0x7fffffff+1, now it has to
> > >> cover 0x11fffffffff+1.  
> > > 
> > > This looks like a chicken and egg problem, you're saying libvirt needs
> > > to query mtree to understand the extent of the GPU layout, but we need
> > > to specify the locked memory limits in order for QEMU to start?  Is
> > > libvirt supposed to start the VM with unlimited locked memory and fix
> > > it at some indeterminate point in the future?  Run a dummy VM with
> > > unlimited locked memory in order to determine the limits for the real
> > > VM?  Neither of these sound practical.  Thanks,  
> > 
> > 
> > QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> > index or can be set explicitely) and libvirt knows how many GPUs are
> > passed so it is quite easy to calculate the required amount of memory.
> > 
> > Here is the window start calculation:
> > https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> > 
> > We do not exactly know the GPU RAM window size until QEMU reads it from
> > VFIO/nvlink2 but we know that all existing hardware has a window of
> > 128GB (the adapters I have access to only have 16/32GB on board).
> 
> So you're asking that libvirt add 128GB per GPU with magic nvlink
> properties, which may be 8x what's actually necessary and libvirt
> determines which GPUs to apply this to how?  Does libvirt need to sort
> through device tree properties for this?  Thanks,

Hm.  If the GPU memory is really separate from main RAM, which it
sounds like, I don't think it makes sense to account it against the
same locked memory limit as regular RAM.
Alex Williamson Feb. 8, 2019, 3:52 p.m. UTC | #8
On Fri, 8 Feb 2019 16:28:50 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:
> > On Fri, 8 Feb 2019 13:29:37 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 08/02/2019 02:18, Alex Williamson wrote:  
> > > > On Thu, 7 Feb 2019 15:43:18 +1100
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >     
> > > >> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:    
> > > >>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> > > >>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> > > >>> GPU:
> > > >>>
> > > >>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> > > >>>
> > > >>>
> > > >>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> > > >>> all the memory it wants. Is this an issue to be considered in the review
> > > >>> of this series here?
> > > >>>
> > > >>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> > > >>> going to lock the KVM memory to populate the NUMA node with memory
> > > >>> of the GPU itself, so at first there is no risk of not taking over the
> > > >>> host RAM.
> > > >>> Am I missing something?      
> > > >>
> > > >>
> > > >> The GPU memory belongs to the device and not visible to the host as
> > > >> memory blocks and not covered by page structs, for the host it is more
> > > >> like MMIO which is passed through to the guest without that locked
> > > >> accounting, I'd expect libvirt to keep working as usual except that:
> > > >>
> > > >> when libvirt calculates the amount of memory needed for TCE tables
> > > >> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> > > >> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> > > >>
> > > >> FlatView #2
> > > >>  AS "memory", root: system
> > > >>  AS "cpu-memory-0", root: system
> > > >>  Root memory region: system
> > > >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> > > >>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> > > >>
> > > >> So previously the DMA window would cover 0x7fffffff+1, now it has to
> > > >> cover 0x11fffffffff+1.    
> > > > 
> > > > This looks like a chicken and egg problem, you're saying libvirt needs
> > > > to query mtree to understand the extent of the GPU layout, but we need
> > > > to specify the locked memory limits in order for QEMU to start?  Is
> > > > libvirt supposed to start the VM with unlimited locked memory and fix
> > > > it at some indeterminate point in the future?  Run a dummy VM with
> > > > unlimited locked memory in order to determine the limits for the real
> > > > VM?  Neither of these sound practical.  Thanks,    
> > > 
> > > 
> > > QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> > > index or can be set explicitely) and libvirt knows how many GPUs are
> > > passed so it is quite easy to calculate the required amount of memory.
> > > 
> > > Here is the window start calculation:
> > > https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> > > 
> > > We do not exactly know the GPU RAM window size until QEMU reads it from
> > > VFIO/nvlink2 but we know that all existing hardware has a window of
> > > 128GB (the adapters I have access to only have 16/32GB on board).  
> > 
> > So you're asking that libvirt add 128GB per GPU with magic nvlink
> > properties, which may be 8x what's actually necessary and libvirt
> > determines which GPUs to apply this to how?  Does libvirt need to sort
> > through device tree properties for this?  Thanks,  
> 
> Hm.  If the GPU memory is really separate from main RAM, which it
> sounds like, I don't think it makes sense to account it against the
> same locked memory limit as regular RAM.

That's true, if the user owns the device and the device provides the
memory backing, it doesn't seem like it should count against the user's
locked memory limit.  That'd make things easy for libvirt.  Thanks,

Alex
Daniel Henrique Barboza Feb. 8, 2019, 4:25 p.m. UTC | #9
On 2/8/19 1:52 PM, Alex Williamson wrote:
>> Hm.  If the GPU memory is really separate from main RAM, which it
>> sounds like, I don't think it makes sense to account it against the
>> same locked memory limit as regular RAM.
> That's true, if the user owns the device and the device provides the
> memory backing, it doesn't seem like it should count against the user's
> locked memory limit.  That'd make things easy for libvirt.  Thanks,

Sounds good.

Is this a QEMU (or even kernel) side change? If so, and we choose to go
through with it, I'll wait for a new spin of this series to see what must
be done in Libvirt to proper support it.


Thanks,


DHB
Alexey Kardashevskiy Feb. 11, 2019, 3:49 a.m. UTC | #10
On 08/02/2019 16:28, David Gibson wrote:
> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:
>> On Fri, 8 Feb 2019 13:29:37 +1100
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 08/02/2019 02:18, Alex Williamson wrote:
>>>> On Thu, 7 Feb 2019 15:43:18 +1100
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:  
>>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
>>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
>>>>>> GPU:
>>>>>>
>>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
>>>>>>
>>>>>>
>>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
>>>>>> all the memory it wants. Is this an issue to be considered in the review
>>>>>> of this series here?
>>>>>>
>>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
>>>>>> going to lock the KVM memory to populate the NUMA node with memory
>>>>>> of the GPU itself, so at first there is no risk of not taking over the
>>>>>> host RAM.
>>>>>> Am I missing something?    
>>>>>
>>>>>
>>>>> The GPU memory belongs to the device and not visible to the host as
>>>>> memory blocks and not covered by page structs, for the host it is more
>>>>> like MMIO which is passed through to the guest without that locked
>>>>> accounting, I'd expect libvirt to keep working as usual except that:
>>>>>
>>>>> when libvirt calculates the amount of memory needed for TCE tables
>>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
>>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
>>>>>
>>>>> FlatView #2
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory-0", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
>>>>>
>>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
>>>>> cover 0x11fffffffff+1.  
>>>>
>>>> This looks like a chicken and egg problem, you're saying libvirt needs
>>>> to query mtree to understand the extent of the GPU layout, but we need
>>>> to specify the locked memory limits in order for QEMU to start?  Is
>>>> libvirt supposed to start the VM with unlimited locked memory and fix
>>>> it at some indeterminate point in the future?  Run a dummy VM with
>>>> unlimited locked memory in order to determine the limits for the real
>>>> VM?  Neither of these sound practical.  Thanks,  
>>>
>>>
>>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
>>> index or can be set explicitely) and libvirt knows how many GPUs are
>>> passed so it is quite easy to calculate the required amount of memory.
>>>
>>> Here is the window start calculation:
>>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
>>>
>>> We do not exactly know the GPU RAM window size until QEMU reads it from
>>> VFIO/nvlink2 but we know that all existing hardware has a window of
>>> 128GB (the adapters I have access to only have 16/32GB on board).
>>
>> So you're asking that libvirt add 128GB per GPU with magic nvlink
>> properties, which may be 8x what's actually necessary and libvirt
>> determines which GPUs to apply this to how?  Does libvirt need to sort
>> through device tree properties for this?  Thanks,
> 
> Hm.  If the GPU memory is really separate from main RAM, which it
> sounds like, I don't think it makes sense to account it against the
> same locked memory limit as regular RAM.


This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.

So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
already does so for the guest RAM.
Alex Williamson Feb. 11, 2019, 6:07 a.m. UTC | #11
On Mon, 11 Feb 2019 14:49:49 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 08/02/2019 16:28, David Gibson wrote:
> > On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:  
> >> On Fri, 8 Feb 2019 13:29:37 +1100
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On 08/02/2019 02:18, Alex Williamson wrote:  
> >>>> On Thu, 7 Feb 2019 15:43:18 +1100
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>     
> >>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:    
> >>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> >>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> >>>>>> GPU:
> >>>>>>
> >>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> >>>>>>
> >>>>>>
> >>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> >>>>>> all the memory it wants. Is this an issue to be considered in the review
> >>>>>> of this series here?
> >>>>>>
> >>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> >>>>>> going to lock the KVM memory to populate the NUMA node with memory
> >>>>>> of the GPU itself, so at first there is no risk of not taking over the
> >>>>>> host RAM.
> >>>>>> Am I missing something?      
> >>>>>
> >>>>>
> >>>>> The GPU memory belongs to the device and not visible to the host as
> >>>>> memory blocks and not covered by page structs, for the host it is more
> >>>>> like MMIO which is passed through to the guest without that locked
> >>>>> accounting, I'd expect libvirt to keep working as usual except that:
> >>>>>
> >>>>> when libvirt calculates the amount of memory needed for TCE tables
> >>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> >>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> >>>>>
> >>>>> FlatView #2
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory-0", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> >>>>>
> >>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
> >>>>> cover 0x11fffffffff+1.    
> >>>>
> >>>> This looks like a chicken and egg problem, you're saying libvirt needs
> >>>> to query mtree to understand the extent of the GPU layout, but we need
> >>>> to specify the locked memory limits in order for QEMU to start?  Is
> >>>> libvirt supposed to start the VM with unlimited locked memory and fix
> >>>> it at some indeterminate point in the future?  Run a dummy VM with
> >>>> unlimited locked memory in order to determine the limits for the real
> >>>> VM?  Neither of these sound practical.  Thanks,    
> >>>
> >>>
> >>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> >>> index or can be set explicitely) and libvirt knows how many GPUs are
> >>> passed so it is quite easy to calculate the required amount of memory.
> >>>
> >>> Here is the window start calculation:
> >>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> >>>
> >>> We do not exactly know the GPU RAM window size until QEMU reads it from
> >>> VFIO/nvlink2 but we know that all existing hardware has a window of
> >>> 128GB (the adapters I have access to only have 16/32GB on board).  
> >>
> >> So you're asking that libvirt add 128GB per GPU with magic nvlink
> >> properties, which may be 8x what's actually necessary and libvirt
> >> determines which GPUs to apply this to how?  Does libvirt need to sort
> >> through device tree properties for this?  Thanks,  
> > 
> > Hm.  If the GPU memory is really separate from main RAM, which it
> > sounds like, I don't think it makes sense to account it against the
> > same locked memory limit as regular RAM.  
> 
> 
> This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.
> 
> So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
> already does so for the guest RAM.

Why do host internal data structures count against the user's locked
memory limit?  We don't include IOMMU page tables or type1 accounting
structures on other archs.  Thanks,

Alex
Alexey Kardashevskiy Feb. 11, 2019, 7:46 a.m. UTC | #12
On 11/02/2019 17:07, Alex Williamson wrote:
> On Mon, 11 Feb 2019 14:49:49 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 08/02/2019 16:28, David Gibson wrote:
>>> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:  
>>>> On Fri, 8 Feb 2019 13:29:37 +1100
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>  
>>>>> On 08/02/2019 02:18, Alex Williamson wrote:  
>>>>>> On Thu, 7 Feb 2019 15:43:18 +1100
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>     
>>>>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:    
>>>>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
>>>>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
>>>>>>>> GPU:
>>>>>>>>
>>>>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
>>>>>>>>
>>>>>>>>
>>>>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
>>>>>>>> all the memory it wants. Is this an issue to be considered in the review
>>>>>>>> of this series here?
>>>>>>>>
>>>>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
>>>>>>>> going to lock the KVM memory to populate the NUMA node with memory
>>>>>>>> of the GPU itself, so at first there is no risk of not taking over the
>>>>>>>> host RAM.
>>>>>>>> Am I missing something?      
>>>>>>>
>>>>>>>
>>>>>>> The GPU memory belongs to the device and not visible to the host as
>>>>>>> memory blocks and not covered by page structs, for the host it is more
>>>>>>> like MMIO which is passed through to the guest without that locked
>>>>>>> accounting, I'd expect libvirt to keep working as usual except that:
>>>>>>>
>>>>>>> when libvirt calculates the amount of memory needed for TCE tables
>>>>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
>>>>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
>>>>>>>
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory-0", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
>>>>>>>
>>>>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
>>>>>>> cover 0x11fffffffff+1.    
>>>>>>
>>>>>> This looks like a chicken and egg problem, you're saying libvirt needs
>>>>>> to query mtree to understand the extent of the GPU layout, but we need
>>>>>> to specify the locked memory limits in order for QEMU to start?  Is
>>>>>> libvirt supposed to start the VM with unlimited locked memory and fix
>>>>>> it at some indeterminate point in the future?  Run a dummy VM with
>>>>>> unlimited locked memory in order to determine the limits for the real
>>>>>> VM?  Neither of these sound practical.  Thanks,    
>>>>>
>>>>>
>>>>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
>>>>> index or can be set explicitely) and libvirt knows how many GPUs are
>>>>> passed so it is quite easy to calculate the required amount of memory.
>>>>>
>>>>> Here is the window start calculation:
>>>>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
>>>>>
>>>>> We do not exactly know the GPU RAM window size until QEMU reads it from
>>>>> VFIO/nvlink2 but we know that all existing hardware has a window of
>>>>> 128GB (the adapters I have access to only have 16/32GB on board).  
>>>>
>>>> So you're asking that libvirt add 128GB per GPU with magic nvlink
>>>> properties, which may be 8x what's actually necessary and libvirt
>>>> determines which GPUs to apply this to how?  Does libvirt need to sort
>>>> through device tree properties for this?  Thanks,  
>>>
>>> Hm.  If the GPU memory is really separate from main RAM, which it
>>> sounds like, I don't think it makes sense to account it against the
>>> same locked memory limit as regular RAM.  
>>
>>
>> This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.
>>
>> So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
>> already does so for the guest RAM.
> 
> Why do host internal data structures count against the user's locked
> memory limit?  We don't include IOMMU page tables or type1 accounting
> structures on other archs.  Thanks,


Because pseries guests create DMA windows dynamically and the userspace
can pass multiple devices to a guest, placing each on its own vPHB each
of which most likely will create an additional 64bit DMA window which is
backed with an IOMMU table => the userspace triggers these allocations.
We account guest RAM once as it is shared among vPHBs but not the IOMMU
tables.
David Gibson Feb. 14, 2019, 4:59 a.m. UTC | #13
On Mon, Feb 11, 2019 at 02:49:49PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 08/02/2019 16:28, David Gibson wrote:
> > On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:
> >> On Fri, 8 Feb 2019 13:29:37 +1100
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>
> >>> On 08/02/2019 02:18, Alex Williamson wrote:
> >>>> On Thu, 7 Feb 2019 15:43:18 +1100
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>   
> >>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:  
> >>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> >>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> >>>>>> GPU:
> >>>>>>
> >>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> >>>>>>
> >>>>>>
> >>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> >>>>>> all the memory it wants. Is this an issue to be considered in the review
> >>>>>> of this series here?
> >>>>>>
> >>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> >>>>>> going to lock the KVM memory to populate the NUMA node with memory
> >>>>>> of the GPU itself, so at first there is no risk of not taking over the
> >>>>>> host RAM.
> >>>>>> Am I missing something?    
> >>>>>
> >>>>>
> >>>>> The GPU memory belongs to the device and not visible to the host as
> >>>>> memory blocks and not covered by page structs, for the host it is more
> >>>>> like MMIO which is passed through to the guest without that locked
> >>>>> accounting, I'd expect libvirt to keep working as usual except that:
> >>>>>
> >>>>> when libvirt calculates the amount of memory needed for TCE tables
> >>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> >>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> >>>>>
> >>>>> FlatView #2
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory-0", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> >>>>>
> >>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
> >>>>> cover 0x11fffffffff+1.  
> >>>>
> >>>> This looks like a chicken and egg problem, you're saying libvirt needs
> >>>> to query mtree to understand the extent of the GPU layout, but we need
> >>>> to specify the locked memory limits in order for QEMU to start?  Is
> >>>> libvirt supposed to start the VM with unlimited locked memory and fix
> >>>> it at some indeterminate point in the future?  Run a dummy VM with
> >>>> unlimited locked memory in order to determine the limits for the real
> >>>> VM?  Neither of these sound practical.  Thanks,  
> >>>
> >>>
> >>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> >>> index or can be set explicitely) and libvirt knows how many GPUs are
> >>> passed so it is quite easy to calculate the required amount of memory.
> >>>
> >>> Here is the window start calculation:
> >>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> >>>
> >>> We do not exactly know the GPU RAM window size until QEMU reads it from
> >>> VFIO/nvlink2 but we know that all existing hardware has a window of
> >>> 128GB (the adapters I have access to only have 16/32GB on board).
> >>
> >> So you're asking that libvirt add 128GB per GPU with magic nvlink
> >> properties, which may be 8x what's actually necessary and libvirt
> >> determines which GPUs to apply this to how?  Does libvirt need to sort
> >> through device tree properties for this?  Thanks,
> > 
> > Hm.  If the GPU memory is really separate from main RAM, which it
> > sounds like, I don't think it makes sense to account it against the
> > same locked memory limit as regular RAM.
> 
> This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.

Ah, ok, that makes sense then

> So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
> already does so for the guest RAM.

That seems reasonable.  IIRC we already have some slop in the amount
of locked vm that libvirt allocates; not sure if it'll be enough as
is.
David Gibson Feb. 14, 2019, 5:02 a.m. UTC | #14
On Mon, Feb 11, 2019 at 06:46:32PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 11/02/2019 17:07, Alex Williamson wrote:
> > On Mon, 11 Feb 2019 14:49:49 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > 
> >> On 08/02/2019 16:28, David Gibson wrote:
> >>> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:  
> >>>> On Fri, 8 Feb 2019 13:29:37 +1100
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>  
> >>>>> On 08/02/2019 02:18, Alex Williamson wrote:  
> >>>>>> On Thu, 7 Feb 2019 15:43:18 +1100
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>     
> >>>>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:    
> >>>>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
> >>>>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
> >>>>>>>> GPU:
> >>>>>>>>
> >>>>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
> >>>>>>>> all the memory it wants. Is this an issue to be considered in the review
> >>>>>>>> of this series here?
> >>>>>>>>
> >>>>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
> >>>>>>>> going to lock the KVM memory to populate the NUMA node with memory
> >>>>>>>> of the GPU itself, so at first there is no risk of not taking over the
> >>>>>>>> host RAM.
> >>>>>>>> Am I missing something?      
> >>>>>>>
> >>>>>>>
> >>>>>>> The GPU memory belongs to the device and not visible to the host as
> >>>>>>> memory blocks and not covered by page structs, for the host it is more
> >>>>>>> like MMIO which is passed through to the guest without that locked
> >>>>>>> accounting, I'd expect libvirt to keep working as usual except that:
> >>>>>>>
> >>>>>>> when libvirt calculates the amount of memory needed for TCE tables
> >>>>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
> >>>>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
> >>>>>>>
> >>>>>>> FlatView #2
> >>>>>>>  AS "memory", root: system
> >>>>>>>  AS "cpu-memory-0", root: system
> >>>>>>>  Root memory region: system
> >>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
> >>>>>>>
> >>>>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
> >>>>>>> cover 0x11fffffffff+1.    
> >>>>>>
> >>>>>> This looks like a chicken and egg problem, you're saying libvirt needs
> >>>>>> to query mtree to understand the extent of the GPU layout, but we need
> >>>>>> to specify the locked memory limits in order for QEMU to start?  Is
> >>>>>> libvirt supposed to start the VM with unlimited locked memory and fix
> >>>>>> it at some indeterminate point in the future?  Run a dummy VM with
> >>>>>> unlimited locked memory in order to determine the limits for the real
> >>>>>> VM?  Neither of these sound practical.  Thanks,    
> >>>>>
> >>>>>
> >>>>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
> >>>>> index or can be set explicitely) and libvirt knows how many GPUs are
> >>>>> passed so it is quite easy to calculate the required amount of memory.
> >>>>>
> >>>>> Here is the window start calculation:
> >>>>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
> >>>>>
> >>>>> We do not exactly know the GPU RAM window size until QEMU reads it from
> >>>>> VFIO/nvlink2 but we know that all existing hardware has a window of
> >>>>> 128GB (the adapters I have access to only have 16/32GB on board).  
> >>>>
> >>>> So you're asking that libvirt add 128GB per GPU with magic nvlink
> >>>> properties, which may be 8x what's actually necessary and libvirt
> >>>> determines which GPUs to apply this to how?  Does libvirt need to sort
> >>>> through device tree properties for this?  Thanks,  
> >>>
> >>> Hm.  If the GPU memory is really separate from main RAM, which it
> >>> sounds like, I don't think it makes sense to account it against the
> >>> same locked memory limit as regular RAM.  
> >>
> >>
> >> This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.
> >>
> >> So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
> >> already does so for the guest RAM.
> > 
> > Why do host internal data structures count against the user's locked
> > memory limit?  We don't include IOMMU page tables or type1 accounting
> > structures on other archs.  Thanks,
> 
> Because pseries guests create DMA windows dynamically and the userspace
> can pass multiple devices to a guest, placing each on its own vPHB each
> of which most likely will create an additional 64bit DMA window which is
> backed with an IOMMU table => the userspace triggers these allocations.
> We account guest RAM once as it is shared among vPHBs but not the IOMMU
> tables.

Uh.. I think that's missing the point.

The real reason is that on x86, IOMMU page tables (IIUC) live within
guest memory, so they're essentially already accounted for.

Under PAPR, the IOMMU tables live outside the guest accessed via
hypercalls.  So, they consume hypervisor memory that the guest can
cause to be allocated.  We need to account that somewhere, so the
guest can't cause the hypervisor to allocate arbitrary amounts of
space.