diff mbox series

[v2,3/3] qom: Link multiple numa nodes to device using a new object

Message ID 20231007201740.30335-4-ankita@nvidia.com
State New
Headers show
Series [v2,1/3] qom: new object to associate device to numa node | expand

Commit Message

Ankit Agrawal Oct. 7, 2023, 8:17 p.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the GI Affinity structures that allows association between nodes
and devices. Multiple GI structures per BDF is possible, allowing creation of
multiple nodes by exposing unique PXM in each of these structures.

Introducing a new nvidia-acpi-generic-initiator object, which inherits from
the generic acpi-generic-initiator object to allow a BDF to be associated with
more than 1 nodes.

An admin can provide the range of nodes using numa-node-start and
numa-node-count and link it to a device by providing its id. The following
sample creates 8 nodes and link them to the device dev0:

        -numa node,nodeid=2 \
        -numa node,nodeid=3 \
        -numa node,nodeid=4 \
        -numa node,nodeid=5 \
        -numa node,nodeid=6 \
        -numa node,nodeid=7 \
        -numa node,nodeid=8 \
        -numa node,nodeid=9 \
        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
        -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

[1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
 include/hw/acpi/acpi-generic-initiator.h | 12 +++++
 qapi/qom.json                            | 24 +++++++++-
 3 files changed, 95 insertions(+), 2 deletions(-)

Comments

David Hildenbrand Oct. 9, 2023, 12:57 p.m. UTC | #1
On 09.10.23 14:30, Jonathan Cameron wrote:
> On Sun, 8 Oct 2023 01:47:40 +0530
> <ankita@nvidia.com> wrote:
> 
>> From: Ankit Agrawal <ankita@nvidia.com>
>>
>> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
>> partitioning of the GPU device resources (including device memory) into
>> several (upto 8) isolated instances. Each of the partitioned memory needs
>> a dedicated NUMA node to operate. The partitions are not fixed and they
>> can be created/deleted at runtime.
>>
>> Unfortunately Linux OS does not provide a means to dynamically create/destroy
>> NUMA nodes and such feature implementation is not expected to be trivial. The
>> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
>> we utilize the GI Affinity structures that allows association between nodes
>> and devices. Multiple GI structures per BDF is possible, allowing creation of
>> multiple nodes by exposing unique PXM in each of these structures.
>>
>> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
>> the generic acpi-generic-initiator object to allow a BDF to be associated with
>> more than 1 nodes.
>>
>> An admin can provide the range of nodes using numa-node-start and
>> numa-node-count and link it to a device by providing its id. The following
>> sample creates 8 nodes and link them to the device dev0:
>>
>>          -numa node,nodeid=2 \
>>          -numa node,nodeid=3 \
>>          -numa node,nodeid=4 \
>>          -numa node,nodeid=5 \
>>          -numa node,nodeid=6 \
>>          -numa node,nodeid=7 \
>>          -numa node,nodeid=8 \
>>          -numa node,nodeid=9 \
>>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>          -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \
> 
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Right, a uint16List should do.
Alex Williamson Oct. 9, 2023, 9:16 p.m. UTC | #2
On Sun, 8 Oct 2023 01:47:40 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per BDF is possible, allowing creation of
> multiple nodes by exposing unique PXM in each of these structures.
> 
> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> the generic acpi-generic-initiator object to allow a BDF to be associated with
> more than 1 nodes.
> 
> An admin can provide the range of nodes using numa-node-start and
> numa-node-count and link it to a device by providing its id. The following
> sample creates 8 nodes and link them to the device dev0:
> 
>         -numa node,nodeid=2 \
>         -numa node,nodeid=3 \
>         -numa node,nodeid=4 \
>         -numa node,nodeid=5 \
>         -numa node,nodeid=6 \
>         -numa node,nodeid=7 \
>         -numa node,nodeid=8 \
>         -numa node,nodeid=9 \
>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

Why didn't we just implement start and count in the base object (or a
list)? It seems like this gives the nvidia-acpi-generic-initiator two
different ways to set gi->node, either node= of the parent or
numa-node-start= here.  Once we expose the implicit node count in the
base object, I'm not sure the purpose of this object.  I would have
thought it for keying the build of the NVIDIA specific _DSD, but that's
not implemented in this version.

I also don't see any programatic means for management tools to know how
many nodes to create.  For example what happens if there's a MIGv2 that
supports 16 partitions by default and makes use of the same vfio-pci
variant driver?  Thanks,

Alex

> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
>  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
>  qapi/qom.json                            | 24 +++++++++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> index 1ae79639be..8ef887c3a4 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
>      }
>      g_slist_free(list);
>  }
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    gi->node_count = value;
> +}
> +
> +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_start,
> +                              NULL, NULL);
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_count,
> +                              NULL, NULL);
> +}
> +
> +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> +    .class_init = nvidia_acpi_generic_initiator_class_init,
> +};
> +
> +static void
> +nvidia_acpi_generic_initiator_register_types(void)
> +{
> +    type_register_static(&nvidia_acpi_generic_initiator_info);
> +}
> +type_init(nvidia_acpi_generic_initiator_register_types);
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> index e8e2670309..3e4cf42064 100644
> --- a/include/hw/acpi/acpi-generic-initiator.h
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -9,10 +9,14 @@
>  #include "qom/object_interfaces.h"
>  
>  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
>  
>  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
>  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
>  
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> +
>  typedef struct AcpiGenericInitiator {
>      /* private */
>      Object parent;
> @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
>      uint64_t res1;
>  } PCIDeviceHandle;
>  
> +typedef struct NvidiaAcpiGenericInitiator {
> +    AcpiGenericInitiator parent;
> +} NvidiaAcpiGenericInitiator;
> +
> +typedef struct NvidiaAcpiGenericInitiatorClass {
> +        AcpiGenericInitiatorClass parent_class;
> +} NvidiaAcpiGenericInitiatorClass;
> +
>  void build_srat_generic_initiator(GArray *table_data);
>  
>  #endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 86c87a161c..c29ad1388d 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -793,6 +793,24 @@
>  { 'struct': 'AcpiGenericInitiatorProperties',
>    'data': { 'device': 'str', 'node': 'uint32' } }
>  
> +##
> +# @NvidiaAcpiGenericInitiatorProperties:
> +#
> +# Properties for nvidia-acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the nodes
> +#
> +# @numa-node-start: the ID of the numa node
> +#
> +# @numa-node-count: count of the numa nodes assocuated with the device
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str',
> +            'numa-node-start': 'uint32',
> +            'numa-node-count': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -962,7 +980,8 @@
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> -    'acpi-generic-initiator'
> +    'acpi-generic-initiator',
> +    'nvidia-acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1030,7 +1049,8 @@
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
>        'x-vfio-user-server':         'VfioUserServerProperties',
> -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
>    } }
>  
>  ##
Alex Williamson Oct. 9, 2023, 9:27 p.m. UTC | #3
On Mon, 9 Oct 2023 13:30:48 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Sun, 8 Oct 2023 01:47:40 +0530
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> > partitioning of the GPU device resources (including device memory) into
> > several (upto 8) isolated instances. Each of the partitioned memory needs
> > a dedicated NUMA node to operate. The partitions are not fixed and they
> > can be created/deleted at runtime.
> > 
> > Unfortunately Linux OS does not provide a means to dynamically create/destroy
> > NUMA nodes and such feature implementation is not expected to be trivial. The
> > nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> > we utilize the GI Affinity structures that allows association between nodes
> > and devices. Multiple GI structures per BDF is possible, allowing creation of
> > multiple nodes by exposing unique PXM in each of these structures.
> > 
> > Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> > the generic acpi-generic-initiator object to allow a BDF to be associated with
> > more than 1 nodes.
> > 
> > An admin can provide the range of nodes using numa-node-start and
> > numa-node-count and link it to a device by providing its id. The following
> > sample creates 8 nodes and link them to the device dev0:
> > 
> >         -numa node,nodeid=2 \
> >         -numa node,nodeid=3 \
> >         -numa node,nodeid=4 \
> >         -numa node,nodeid=5 \
> >         -numa node,nodeid=6 \
> >         -numa node,nodeid=7 \
> >         -numa node,nodeid=8 \
> >         -numa node,nodeid=9 \
> >         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \  
> 
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Yup, I was looking for other places we allow a list syntax, I only
found fds=a:b:...:z, which is also used for vhostfds=.  I didn't find a
property beyond the string type to hold this though.

> I think it is simpler the other way around though - so have the numa nodes point at the
> vfio-pci-nohotplug device. 

Do you have a syntax you'd propose for this?  I'm having trouble seeing
how it makes things simpler.  Thanks,

Alex

> > [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> > 
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
> >  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
> >  qapi/qom.json                            | 24 +++++++++-
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> > index 1ae79639be..8ef887c3a4 100644
> > --- a/hw/acpi/acpi-generic-initiator.c
> > +++ b/hw/acpi/acpi-generic-initiator.c
> > @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
> >      }
> >      g_slist_free(list);
> >  }
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> > +                                             const char *name, void *opaque,
> > +                                             Error **errp)
> > +{
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    uint32_t value;
> > +
> > +    if (!visit_type_uint32(v, name, &value, errp)) {
> > +        return;
> > +    }
> > +
> > +    if (value >= MAX_NODES) {
> > +        return;
> > +    }
> > +
> > +    gi->node = value;
> > +}
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> > +                                             const char *name, void *opaque,
> > +                                             Error **errp)
> > +{
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    uint32_t value;
> > +
> > +    if (!visit_type_uint32(v, name, &value, errp)) {
> > +        return;
> > +    }
> > +
> > +    gi->node_count = value;
> > +}
> > +
> > +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> > +{
> > +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> > +                              "uint32", NULL,
> > +                              nvidia_acpi_generic_initiator_set_node_start,
> > +                              NULL, NULL);
> > +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> > +                              "uint32", NULL,
> > +                              nvidia_acpi_generic_initiator_set_node_count,
> > +                              NULL, NULL);
> > +}
> > +
> > +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> > +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> > +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> > +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> > +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> > +    .class_init = nvidia_acpi_generic_initiator_class_init,
> > +};
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_register_types(void)
> > +{
> > +    type_register_static(&nvidia_acpi_generic_initiator_info);
> > +}
> > +type_init(nvidia_acpi_generic_initiator_register_types);
> > diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> > index e8e2670309..3e4cf42064 100644
> > --- a/include/hw/acpi/acpi-generic-initiator.h
> > +++ b/include/hw/acpi/acpi-generic-initiator.h
> > @@ -9,10 +9,14 @@
> >  #include "qom/object_interfaces.h"
> >  
> >  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> > +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
> >  
> >  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> >  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> >  
> > +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> > +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> > +
> >  typedef struct AcpiGenericInitiator {
> >      /* private */
> >      Object parent;
> > @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
> >      uint64_t res1;
> >  } PCIDeviceHandle;
> >  
> > +typedef struct NvidiaAcpiGenericInitiator {
> > +    AcpiGenericInitiator parent;
> > +} NvidiaAcpiGenericInitiator;
> > +
> > +typedef struct NvidiaAcpiGenericInitiatorClass {
> > +        AcpiGenericInitiatorClass parent_class;
> > +} NvidiaAcpiGenericInitiatorClass;
> > +
> >  void build_srat_generic_initiator(GArray *table_data);
> >  
> >  #endif
> > diff --git a/qapi/qom.json b/qapi/qom.json
> > index 86c87a161c..c29ad1388d 100644
> > --- a/qapi/qom.json
> > +++ b/qapi/qom.json
> > @@ -793,6 +793,24 @@
> >  { 'struct': 'AcpiGenericInitiatorProperties',
> >    'data': { 'device': 'str', 'node': 'uint32' } }
> >  
> > +##
> > +# @NvidiaAcpiGenericInitiatorProperties:
> > +#
> > +# Properties for nvidia-acpi-generic-initiator objects.
> > +#
> > +# @device: the ID of the device to be associated with the nodes
> > +#
> > +# @numa-node-start: the ID of the numa node
> > +#
> > +# @numa-node-count: count of the numa nodes assocuated with the device
> > +#
> > +# Since: 8.0
> > +##
> > +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> > +  'data': { 'device': 'str',
> > +            'numa-node-start': 'uint32',
> > +            'numa-node-count': 'uint32' } }
> > +
> >  ##
> >  # @RngProperties:
> >  #
> > @@ -962,7 +980,8 @@
> >      'tls-cipher-suites',
> >      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> >      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> > -    'acpi-generic-initiator'
> > +    'acpi-generic-initiator',
> > +    'nvidia-acpi-generic-initiator'
> >    ] }
> >  
> >  ##
> > @@ -1030,7 +1049,8 @@
> >        'tls-cipher-suites':          'TlsCredsProperties',
> >        'x-remote-object':            'RemoteObjectProperties',
> >        'x-vfio-user-server':         'VfioUserServerProperties',
> > -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> > +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> > +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
> >    } }
> >  
> >  ##  
>
Markus Armbruster Oct. 13, 2023, 1:17 p.m. UTC | #4
Same comments as for PATCH 1.
Ankit Agrawal Oct. 17, 2023, 2 p.m. UTC | #5
>>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8
>
> Why didn't we just implement start and count in the base object (or a
> list)? It seems like this gives the nvidia-acpi-generic-initiator two
> different ways to set gi->node, either node= of the parent or
> numa-node-start= here.  Once we expose the implicit node count in the
> base object, I'm not sure the purpose of this object.  I would have
> thought it for keying the build of the NVIDIA specific _DSD, but that's
> not implemented in this version.

Agree, allowing a list of nodes to be provided to the acpi-generic-initiator
will remove the need for the nvidia-acpi-generic-initiator object. 

> I also don't see any programatic means for management tools to know how
> many nodes to create.  For example what happens if there's a MIGv2 that
> supports 16 partitions by default and makes use of the same vfio-pci
> variant driver?  Thanks,

It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed
through proper documentation in the user manual?
Ankit Agrawal Oct. 17, 2023, 2:18 p.m. UTC | #6
>> An admin can provide the range of nodes using numa-node-start and
>> numa-node-count and link it to a device by providing its id. The following
>> sample creates 8 nodes and link them to the device dev0:
>> 
>>         -numa node,nodeid=2 \
>>         -numa node,nodeid=3 \
>>         -numa node,nodeid=4 \
>>         -numa node,nodeid=5 \
>>         -numa node,nodeid=6 \
>>         -numa node,nodeid=7 \
>>         -numa node,nodeid=8 \
>>         -numa node,nodeid=9 \
>>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 
>
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Right.

> I think it is simpler the other way around though - so have the numa nodes point at the
> vfio-pci-nohotplug device. 
>
> Why not something like
> -numa node,gi=dev0 \
> -numa node,gi=dev0 \

That could be one way to do this vs the other suggestion on this thread to provide
a node list through the acpi-generic-initiator object.

Just that the Qemu init numa nodes before the vfio-pci device, which could affect
our ability to fail in case the user provides the wrong vfio-pci id.

I am fine with both approach. Will implement the node list option if there aren't
any strong opinion.

>> If you go this way, use an array of references to the numa nodes instead of a start and number.
>> There is no obvious reason why they should be contiguous that I can see.
>
> Yup, I was looking for other places we allow a list syntax, I only
> found fds=a:b:...:z, which is also used for vhostfds=.  I didn't find a
> property beyond the string type to hold this though.

Maybe something like 'node=1-3,8-10'?

> Right, a uint16List should do.

Sure, thanks.
Alex Williamson Oct. 17, 2023, 3:21 p.m. UTC | #7
On Tue, 17 Oct 2023 14:00:54 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8  
> >
> > Why didn't we just implement start and count in the base object (or a
> > list)? It seems like this gives the nvidia-acpi-generic-initiator two
> > different ways to set gi->node, either node= of the parent or
> > numa-node-start= here.  Once we expose the implicit node count in the
> > base object, I'm not sure the purpose of this object.  I would have
> > thought it for keying the build of the NVIDIA specific _DSD, but that's
> > not implemented in this version.  
> 
> Agree, allowing a list of nodes to be provided to the acpi-generic-initiator
> will remove the need for the nvidia-acpi-generic-initiator object. 

And what happened to the _DSD?  Is it no longer needed?  Why?

> > I also don't see any programatic means for management tools to know how
> > many nodes to create.  For example what happens if there's a MIGv2 that
> > supports 16 partitions by default and makes use of the same vfio-pci
> > variant driver?  Thanks,  
> 
> It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed
> through proper documentation in the user manual?

I thought the intention here was that a management tool would
automatically configure the VM with these nodes and GI object in
support of the device.  Planning only for Grace-Hopper isn't looking
too far into the future and it's difficult to make software that can
reference a user manual.  This leads to a higher maintenance burden
where the management tool needs to recognize not only the driver, but
the device bound to the driver and update as new devices are released.
The management tool will never automatically support new devices without
making an assumption about the node configuration.

Do we therefore need some programatic means for the kernel driver to
expose the node configuration to userspace?  What interfaces would
libvirt like to see here?  Is there an opportunity that this could
begin to define flavors or profiles for variant devices like we have
types for mdev devices where the node configuration would be
encompassed in a device profile?  Thanks,

Alex
Jason Gunthorpe Oct. 17, 2023, 3:28 p.m. UTC | #8
On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:

> Do we therefore need some programatic means for the kernel driver to
> expose the node configuration to userspace?  What interfaces would
> libvirt like to see here?  Is there an opportunity that this could
> begin to define flavors or profiles for variant devices like we have
> types for mdev devices where the node configuration would be
> encompassed in a device profile? 

I don't think we should shift this mess into the kernel..

We have a wide range of things now that the orchestration must do in
order to prepare that are fairly device specific. I understand in K8S
configurations the preference is using operators (aka user space
drivers) to trigger these things.

Supplying a few extra qemu command line options seems minor compared
to all the profile and provisioning work that has to happen for other
device types.

Jason
Alex Williamson Oct. 17, 2023, 4:54 p.m. UTC | #9
On Tue, 17 Oct 2023 12:28:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> 
> > Do we therefore need some programatic means for the kernel driver to
> > expose the node configuration to userspace?  What interfaces would
> > libvirt like to see here?  Is there an opportunity that this could
> > begin to define flavors or profiles for variant devices like we have
> > types for mdev devices where the node configuration would be
> > encompassed in a device profile?   
> 
> I don't think we should shift this mess into the kernel..
> 
> We have a wide range of things now that the orchestration must do in
> order to prepare that are fairly device specific. I understand in K8S
> configurations the preference is using operators (aka user space
> drivers) to trigger these things.
> 
> Supplying a few extra qemu command line options seems minor compared
> to all the profile and provisioning work that has to happen for other
> device types.

This seems to be a growing problem for things like mlx5-vfio-pci where
there's non-trivial device configuration necessary to enable migration
support.  It's not super clear to me how those devices are actually
expected to be used in practice with that configuration burden.

Are we simply saying here that it's implicit knowledge that the
orchestration must posses that when assigning devices exactly matching
10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
that 8 additional NUMA nodes should be added to the VM and an ACPI
generic initiator object created linking those additional nodes to the
assigned GPU?

Is libvirt ok with that specification or are we simply going to bubble
this up as a user problem? Thanks,

Alex
Jason Gunthorpe Oct. 17, 2023, 5:24 p.m. UTC | #10
On Tue, Oct 17, 2023 at 10:54:19AM -0600, Alex Williamson wrote:
> On Tue, 17 Oct 2023 12:28:30 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> > 
> > > Do we therefore need some programatic means for the kernel driver to
> > > expose the node configuration to userspace?  What interfaces would
> > > libvirt like to see here?  Is there an opportunity that this could
> > > begin to define flavors or profiles for variant devices like we have
> > > types for mdev devices where the node configuration would be
> > > encompassed in a device profile?   
> > 
> > I don't think we should shift this mess into the kernel..
> > 
> > We have a wide range of things now that the orchestration must do in
> > order to prepare that are fairly device specific. I understand in K8S
> > configurations the preference is using operators (aka user space
> > drivers) to trigger these things.
> > 
> > Supplying a few extra qemu command line options seems minor compared
> > to all the profile and provisioning work that has to happen for other
> > device types.
> 
> This seems to be a growing problem for things like mlx5-vfio-pci where
> there's non-trivial device configuration necessary to enable migration
> support.  It's not super clear to me how those devices are actually
> expected to be used in practice with that configuration burden.

Yes, it is the nature of the situation. There is lots and lots of
stuff in the background here. We can nibble at some things, but I
don't see a way to be completely free of a userspace driver providing
the orchestration piece for every device type.

Maybe someone who knows more about the k8s stuff works can explain
more?

> Are we simply saying here that it's implicit knowledge that the
> orchestration must posses that when assigning devices exactly matching
> 10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
> that 8 additional NUMA nodes should be added to the VM and an ACPI
> generic initiator object created linking those additional nodes to the
> assigned GPU?

What I'm trying to say is that orchestration should try to pull in a
userspace driver to provide the non-generic pieces.

But, it isn't clear to me what that driver is generically. Something
like this case is pretty stand alone, but mlx5 needs to interact with
the networking control plane to fully provision the PCI
function. Storage needs a different control plane. Few PCI devices are
so stand alone they can be provisioned without complicated help. 

Even things like IDXD need orchestration to sort of uniquely
understand how to spawn their SIOV functions.

I'm not sure I see a clear vision here from libvirt side how all these
parts interact in the libvirt world, or if the answer is "use
openshift and the operators".

Jason
diff mbox series

Patch

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
index 1ae79639be..8ef887c3a4 100644
--- a/hw/acpi/acpi-generic-initiator.c
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -150,3 +150,64 @@  void build_srat_generic_initiator(GArray *table_data)
     }
     g_slist_free(list);
 }
+
+static void
+nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
+                                             const char *name, void *opaque,
+                                             Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint32_t value;
+
+    if (!visit_type_uint32(v, name, &value, errp)) {
+        return;
+    }
+
+    if (value >= MAX_NODES) {
+        return;
+    }
+
+    gi->node = value;
+}
+
+static void
+nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
+                                             const char *name, void *opaque,
+                                             Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint32_t value;
+
+    if (!visit_type_uint32(v, name, &value, errp)) {
+        return;
+    }
+
+    gi->node_count = value;
+}
+
+static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+{
+    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
+                              "uint32", NULL,
+                              nvidia_acpi_generic_initiator_set_node_start,
+                              NULL, NULL);
+    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
+                              "uint32", NULL,
+                              nvidia_acpi_generic_initiator_set_node_count,
+                              NULL, NULL);
+}
+
+static const TypeInfo nvidia_acpi_generic_initiator_info = {
+    .parent = TYPE_ACPI_GENERIC_INITIATOR,
+    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
+    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
+    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
+    .class_init = nvidia_acpi_generic_initiator_class_init,
+};
+
+static void
+nvidia_acpi_generic_initiator_register_types(void)
+{
+    type_register_static(&nvidia_acpi_generic_initiator_info);
+}
+type_init(nvidia_acpi_generic_initiator_register_types);
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
index e8e2670309..3e4cf42064 100644
--- a/include/hw/acpi/acpi-generic-initiator.h
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -9,10 +9,14 @@ 
 #include "qom/object_interfaces.h"
 
 #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
 
 #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
 #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
 
+#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
+#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
+
 typedef struct AcpiGenericInitiator {
     /* private */
     Object parent;
@@ -47,6 +51,14 @@  typedef struct PCIDeviceHandle {
     uint64_t res1;
 } PCIDeviceHandle;
 
+typedef struct NvidiaAcpiGenericInitiator {
+    AcpiGenericInitiator parent;
+} NvidiaAcpiGenericInitiator;
+
+typedef struct NvidiaAcpiGenericInitiatorClass {
+        AcpiGenericInitiatorClass parent_class;
+} NvidiaAcpiGenericInitiatorClass;
+
 void build_srat_generic_initiator(GArray *table_data);
 
 #endif
diff --git a/qapi/qom.json b/qapi/qom.json
index 86c87a161c..c29ad1388d 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -793,6 +793,24 @@ 
 { 'struct': 'AcpiGenericInitiatorProperties',
   'data': { 'device': 'str', 'node': 'uint32' } }
 
+##
+# @NvidiaAcpiGenericInitiatorProperties:
+#
+# Properties for nvidia-acpi-generic-initiator objects.
+#
+# @device: the ID of the device to be associated with the nodes
+#
+# @numa-node-start: the ID of the numa node
+#
+# @numa-node-count: count of the numa nodes assocuated with the device
+#
+# Since: 8.0
+##
+{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
+  'data': { 'device': 'str',
+            'numa-node-start': 'uint32',
+            'numa-node-count': 'uint32' } }
+
 ##
 # @RngProperties:
 #
@@ -962,7 +980,8 @@ 
     'tls-cipher-suites',
     { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
     { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
-    'acpi-generic-initiator'
+    'acpi-generic-initiator',
+    'nvidia-acpi-generic-initiator'
   ] }
 
 ##
@@ -1030,7 +1049,8 @@ 
       'tls-cipher-suites':          'TlsCredsProperties',
       'x-remote-object':            'RemoteObjectProperties',
       'x-vfio-user-server':         'VfioUserServerProperties',
-      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
+      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
+      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
   } }
 
 ##