diff mbox series

[v9,09/11] numa: Extend the CLI to provide memory latency and bandwidth information

Message ID 20190809065731.9097-10-tao3.xu@intel.com
State New
Headers show
Series Build ACPI Heterogeneous Memory Attribute Table (HMAT) | expand

Commit Message

Tao Xu Aug. 9, 2019, 6:57 a.m. UTC
From: Liu Jingqi <jingqi.liu@intel.com>

Add -numa hmat-lb option to provide System Locality Latency and
Bandwidth Information. These memory attributes help to build
System Locality Latency and Bandwidth Information Structure(s)
in ACPI Heterogeneous Memory Attribute Table (HMAT).

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v9:
    - change the CLI input way, make it more user firendly (Daniel Black)
    use latency=NUM[p|n|u]s and bandwidth=NUM[M|G|P](B/s) as input and drop
    the base-lat and base-bw input.
---
 hw/acpi/hmat.h        |   3 +
 hw/core/numa.c        | 185 ++++++++++++++++++++++++++++++++++++++++++
 include/sysemu/numa.h |   2 +
 qapi/machine.json     |  95 +++++++++++++++++++++-
 qemu-options.hx       |  44 +++++++++-
 5 files changed, 326 insertions(+), 3 deletions(-)

Comments

Daniel Black Aug. 12, 2019, 5:13 a.m. UTC | #1
Tao Xu, Liu Jingqi,

Thanks for doing these updates.

On Fri,  9 Aug 2019 14:57:29 +0800
Tao <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> Add -numa hmat-lb option to provide System Locality Latency and
> Bandwidth Information. These memory attributes help to build
> System Locality Latency and Bandwidth Information Structure(s)
> in ACPI Heterogeneous Memory Attribute Table (HMAT).
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
>  hw/acpi/hmat.h        |   3 +
>  hw/core/numa.c        | 185
> ++++++++++++++++++++++++++++++++++++++++++ include/sysemu/numa.h |
> 2 + qapi/machine.json     |  95 +++++++++++++++++++++-
>  qemu-options.hx       |  44 +++++++++-
>  5 files changed, 326 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
> index 6c32f12e78..b7c1e02cf0 100644
> --- a/hw/acpi/hmat.h
> +++ b/hw/acpi/hmat.h
> @@ -42,6 +42,9 @@
>  
>  #define HMAT_IS_LATENCY(type) (type <= HMAT_LB_DATA_WRITE_LATENCY)
>  
> +#define PICO_PER_USEC 1000000
> +#define PICO_PER_NSEC 1000
> +
>  struct HMAT_LB_Info {
>      /*
>       * Indicates total number of Proximity Domains
> diff --git a/hw/core/numa.c b/hw/core/numa.c
> index cfb6339810..9a494145f3 100644
> --- a/hw/core/numa.c
> +++ b/hw/core/numa.c
> @@ -37,6 +37,7 @@
>  #include "qemu/option.h"
>  #include "qemu/config-file.h"
>  #include "qemu/cutils.h"
> +#include "hw/acpi/hmat.h"
>  
>  QemuOptsList qemu_numa_opts = {
>      .name = "numa",
> @@ -183,6 +184,184 @@ void parse_numa_distance(MachineState *ms,
> NumaDistOptions *dist, Error **errp)
> ms->numa_state->have_numa_distance = true; }
>  
> +void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
> +                        Error **errp)
> +{
..

Optional; you could support not connected (0xffff) for latency/bandwidth in
this parsing.

> +        if (*endptr == '\0') {
> +            base_lat = 1;
> +        } else if (*(endptr + 1) == 's') {
> +            switch (*endptr) {
> +            case 'p':
> +                base_lat = 1;
> +                break;
> +            case 'n':
> +                base_lat = PICO_PER_NSEC;
> +                break;
> +            case 'u':

Glad you picked up my mismatch of "u/micro".

> +        } else {
> +            error_setg(errp, "Invalid latency unit %s,"
> +                "vaild units are \"ps\" \"ns\" \"us\"",
>node->latency);

typo "valid"

> +        } else if (hmat_lb->base_lat != base_lat) {
> +            error_setg(errp, "Invalid latency unit %s,"
> +                " please unify the units.", node->latency);

This error is misleading. Should be something like "all latencies must be
specified in the same units"

> +        switch (toupper(*endptr)) {
> +        case '\0':
> +        case 'M':
> +            base_bw = 1;
> +            break;
> +        case 'G':
> +            base_bw = UINT64_C(1) << 10;
> +            break;

There was one more gap - Terra.

        case 'T':
           base_bw = UINT64_C(1) << 20;
           break;

> +        case 'P':
> +            base_bw = UINT64_C(1) << 20;
and:
               base_bw = UINT64_C(1) << 30;

> +            break;
> +        }


Currently Linux 5.3.0-rc3+ doesn't cope with real corrected "bandwidth=2P" so
maybe not worth it.

[    2.092060] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator
Domains:1 Target Domains:2 Base:1073741824 [    2.092326]   Initiator-Target[0-0]:-2147483648 MB/s

On values, testing for overflow is required. e.g:

 -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=4096T
bandwidth=4096T

[    2.047676] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:1048576
[    2.048084]   Initiator-Target[0-0]:0 MB/s

Technically ACPI could support up to 4P with base/offset but you'd need to be a
lot trickier (i.e. base is highest common multiple of all entries and then see
if entry/base > 2^32-2 ) with base/entry values to arrive at this number.

+docs/commit message propagation of this.


> +        } else if (hmat_lb->base_lat != base_lat) {

Bug: Incorrectly copied - base_lat should be base_bw (twice)

> +            error_setg(errp, "Invalid bandwidth unit %s,"
> +                " please unify the units.", node->bandwidth);

This error is misleading. Should be something like "all bandwidths must be
specified in the same units"

> diff --git a/qemu-options.hx b/qemu-options.hx
> index c480781992..cda4607f3a 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx

> +@example
> +-m 2G \
> +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 -numa node,nodeid=0,memdev=ram-node0 \
> +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=1,id=ram-node1 -numa node,nodeid=1,memdev=ram-node1 \
> +-smp 2 \
> +-numa cpu,node-id=0,socket-id=0 \
> +-numa cpu,node-id=0,socket-id=1 \
> +-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5ns \
> +-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
> +-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10ns \
> +-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \
> +@end example

nit: remove slash on last line

Is this a valid example? I get

qemu-system-x86_64: -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=11us: Invalid target=1, it hasn't a valid initiator proximity domain.

(I tested with host-nodes=1 changed to 0 as local machine is single node)

Technically on [PATCH v9 07/11]
diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index abf99b1adc..431818dc82 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -67,11 +67,81 @@ static void build_hmat_mpda(GArray *table_data, uint16_t flags, int initiator,
     build_append_int_noprefix(table_data, 0, 8);
 }
 
+/*
+ * ACPI 6.3: 5.2.27.4 System Locality Latency and Bandwidth Information
+ * Structure: Table 5-142

nit: 5-146

Test as follows:

qemu-system-x86_64   -kernel /home/dan/repos/linux/vmlinux   -nographic -append  console=ttyS0  \
   -m 2G -object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 \
   -numa node,nodeid=0,memdev=ram-node0 \
   -object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node1 \
   -numa node,nodeid=1,memdev=ram-node1 -smp 2 -numa cpu,node-id=0,socket-id=0 \
   -numa cpu,node-id=0,socket-id=1  \
   -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=123us \
   -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
   -numa hmat-cache,node-id=0,size=0x20000,total=1,level=1,assoc=direct,policy=write-back,line=8 \
   -numa hmat-cache,node-id=1,size=0x20000,total=1,level=1,assoc=direct,policy=write-back,line=8 \
| grep -A 5 HMAT


[    0.038912] ACPI: HMAT 0x000000007FFE16C5 000118 (v02 BOCHS  BXPCHMAT 00000001 BXPC 00000001)
[    0.040954] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.040999] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.041189] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.041250] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
[    0.041276] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
--
[    1.984572] HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:0
[    1.984792] HMAT: Memory Flags:0000 Processor Domain:0 Memory Domain:1
[    1.985435] HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:1 Target Domains:2 Base:1000000
[    1.986424]   Initiator-Target[0-0]:123000 nsec
[    1.986664]   Initiator-Target[0-1]:0 nsec
[    1.986910] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:1
[    1.987229]   Initiator-Target[0-0]:200 MB/s
[    1.987356]   Initiator-Target[0-1]:0 MB/s
[    1.987549] HMAT: Cache: Domain:0 Size:131072 Attrs:00081111 SMBIOS Handles:0
[    1.988393] HMAT: Cache: Domain:1 Size:131072 Attrs:00081111 SMBIOS Handles:0

Leaving default latency/bw as 0 if ok as spec says '0: the corresponding latency
or bandwidth information is not provided.'. Potentially the kernel could display this better.

Also note https://marc.info/?l=linux-acpi&m=156506549410279&w=2 submitted as
hmat_build_table_structs only calls build_hmat_mpda with flags=0 or HMAT_PROX_INIT_VALID (0x1) which is right looking at ACPI-6.3. An Ack/(Nack if I'm wrong) there would be good to have both kernel and this patch series working together.

for entire series:

Reviewed-by: Daniel Black <daniel@linux.ibm.com>
Tao Xu Aug. 12, 2019, 6:11 a.m. UTC | #2
On 8/12/2019 1:13 PM, Daniel Black wrote:
> 
> 
> Tao Xu, Liu Jingqi,
> 
> Thanks for doing these updates.
> 
> On Fri,  9 Aug 2019 14:57:29 +0800
> Tao <tao3.xu@intel.com> wrote:
> 
>> From: Liu Jingqi <jingqi.liu@intel.com>
>>
>> Add -numa hmat-lb option to provide System Locality Latency and
>> Bandwidth Information. These memory attributes help to build
>> System Locality Latency and Bandwidth Information Structure(s)
>> in ACPI Heterogeneous Memory Attribute Table (HMAT).
>>
>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
>>   hw/acpi/hmat.h        |   3 +
>>   hw/core/numa.c        | 185
>> ++++++++++++++++++++++++++++++++++++++++++ include/sysemu/numa.h |
>> 2 + qapi/machine.json     |  95 +++++++++++++++++++++-
>>   qemu-options.hx       |  44 +++++++++-
>>   5 files changed, 326 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
>> index 6c32f12e78..b7c1e02cf0 100644
>> --- a/hw/acpi/hmat.h
>> +++ b/hw/acpi/hmat.h
>> @@ -42,6 +42,9 @@
>>   
>>   #define HMAT_IS_LATENCY(type) (type <= HMAT_LB_DATA_WRITE_LATENCY)
>>   
>> +#define PICO_PER_USEC 1000000
>> +#define PICO_PER_NSEC 1000
>> +
>>   struct HMAT_LB_Info {
>>       /*
>>        * Indicates total number of Proximity Domains
>> diff --git a/hw/core/numa.c b/hw/core/numa.c
>> index cfb6339810..9a494145f3 100644
>> --- a/hw/core/numa.c
>> +++ b/hw/core/numa.c
>> @@ -37,6 +37,7 @@
>>   #include "qemu/option.h"
>>   #include "qemu/config-file.h"
>>   #include "qemu/cutils.h"
>> +#include "hw/acpi/hmat.h"
>>   
>>   QemuOptsList qemu_numa_opts = {
>>       .name = "numa",
>> @@ -183,6 +184,184 @@ void parse_numa_distance(MachineState *ms,
>> NumaDistOptions *dist, Error **errp)
>> ms->numa_state->have_numa_distance = true; }
>>   
>> +void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
>> +                        Error **errp)
>> +{
> ..
> 
> Optional; you could support not connected (0xffff) for latency/bandwidth in
> this parsing.
> 
>> +        if (*endptr == '\0') {
>> +            base_lat = 1;
>> +        } else if (*(endptr + 1) == 's') {
>> +            switch (*endptr) {
>> +            case 'p':
>> +                base_lat = 1;
>> +                break;
>> +            case 'n':
>> +                base_lat = PICO_PER_NSEC;
>> +                break;
>> +            case 'u':
> 
> Glad you picked up my mismatch of "u/micro".
> 
>> +        } else {
>> +            error_setg(errp, "Invalid latency unit %s,"
>> +                "vaild units are \"ps\" \"ns\" \"us\"",
>> node->latency);
> 
> typo "valid"
> 
>> +        } else if (hmat_lb->base_lat != base_lat) {
>> +            error_setg(errp, "Invalid latency unit %s,"
>> +                " please unify the units.", node->latency);
> 
> This error is misleading. Should be something like "all latencies must be
> specified in the same units"
> 
>> +        switch (toupper(*endptr)) {
>> +        case '\0':
>> +        case 'M':
>> +            base_bw = 1;
>> +            break;
>> +        case 'G':
>> +            base_bw = UINT64_C(1) << 10;
>> +            break;
> 
> There was one more gap - Terra.
> 
>          case 'T':
>             base_bw = UINT64_C(1) << 20;
>             break;
> 
Oh, my mistake, I should use TB/s instead of PB/s.
Thank you for pointing out this.

>> +        case 'P':
>> +            base_bw = UINT64_C(1) << 20;
> and:
>                 base_bw = UINT64_C(1) << 30;
> 
>> +            break;
>> +        }
> 
> 
> Currently Linux 5.3.0-rc3+ doesn't cope with real corrected "bandwidth=2P" so
> maybe not worth it. >
> [    2.092060] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator
> Domains:1 Target Domains:2 Base:1073741824 [    2.092326]   Initiator-Target[0-0]:-2147483648 MB/s
> 
> On values, testing for overflow is required. e.g:
> 
>   -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=4096T
> bandwidth=4096T
> 
> [    2.047676] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:1048576
> [    2.048084]   Initiator-Target[0-0]:0 MB/s
> 
> Technically ACPI could support up to 4P with base/offset but you'd need to be a
> lot trickier (i.e. base is highest common multiple of all entries and then see
> if entry/base > 2^32-2 ) with base/entry values to arrive at this number.
> 
> +docs/commit message propagation of this.
>

I agree. I also test the overflow case. Thank you for your suggestion. I 
will add a docs for it.
> 
>> +        } else if (hmat_lb->base_lat != base_lat) {
> 
> Bug: Incorrectly copied - base_lat should be base_bw (twice)
> 

My mistake, I will correct it.
>> +            error_setg(errp, "Invalid bandwidth unit %s,"
>> +                " please unify the units.", node->bandwidth);
> 
> This error is misleading. Should be something like "all bandwidths must be
> specified in the same units"
> 
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index c480781992..cda4607f3a 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
> 
>> +@example
>> +-m 2G \
>> +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 -numa node,nodeid=0,memdev=ram-node0 \
>> +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=1,id=ram-node1 -numa node,nodeid=1,memdev=ram-node1 \
>> +-smp 2 \
>> +-numa cpu,node-id=0,socket-id=0 \
>> +-numa cpu,node-id=0,socket-id=1 \
>> +-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5ns \
>> +-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
>> +-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10ns \
>> +-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \
>> +@end example
> 
> nit: remove slash on last line
> 
> Is this a valid example? I get
> 
> qemu-system-x86_64: -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=11us: Invalid target=1, it hasn't a valid initiator proximity domain.
> 
> (I tested with host-nodes=1 changed to 0 as local machine is single node)
> 

I forget to update the example. It should add as follow:

   -numa node,nodeid=0,memdev=ram-node0 \
   -numa node,nodeid=1,memdev=ram-node1,initiator=1 \

> Technically on [PATCH v9 07/11]
> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> index abf99b1adc..431818dc82 100644
> --- a/hw/acpi/hmat.c
> +++ b/hw/acpi/hmat.c
> @@ -67,11 +67,81 @@ static void build_hmat_mpda(GArray *table_data, uint16_t flags, int initiator,
>       build_append_int_noprefix(table_data, 0, 8);
>   }
>   
> +/*
> + * ACPI 6.3: 5.2.27.4 System Locality Latency and Bandwidth Information
> + * Structure: Table 5-142
> 
> nit: 5-146
> 
> Test as follows:
> 
> qemu-system-x86_64   -kernel /home/dan/repos/linux/vmlinux   -nographic -append  console=ttyS0  \
>     -m 2G -object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 \
>     -numa node,nodeid=0,memdev=ram-node0 \
>     -object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node1 \
>     -numa node,nodeid=1,memdev=ram-node1 -smp 2 -numa cpu,node-id=0,socket-id=0 \
>     -numa cpu,node-id=0,socket-id=1  \
>     -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=123us \
>     -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
>     -numa hmat-cache,node-id=0,size=0x20000,total=1,level=1,assoc=direct,policy=write-back,line=8 \
>     -numa hmat-cache,node-id=1,size=0x20000,total=1,level=1,assoc=direct,policy=write-back,line=8 \
> | grep -A 5 HMAT
> 
> 
> [    0.038912] ACPI: HMAT 0x000000007FFE16C5 000118 (v02 BOCHS  BXPCHMAT 00000001 BXPC 00000001)
> [    0.040954] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [    0.040999] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> [    0.041189] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.041250] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [    0.041276] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> --
> [    1.984572] HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:0
> [    1.984792] HMAT: Memory Flags:0000 Processor Domain:0 Memory Domain:1
> [    1.985435] HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:1 Target Domains:2 Base:1000000
> [    1.986424]   Initiator-Target[0-0]:123000 nsec
> [    1.986664]   Initiator-Target[0-1]:0 nsec
> [    1.986910] HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:1
> [    1.987229]   Initiator-Target[0-0]:200 MB/s
> [    1.987356]   Initiator-Target[0-1]:0 MB/s
> [    1.987549] HMAT: Cache: Domain:0 Size:131072 Attrs:00081111 SMBIOS Handles:0
> [    1.988393] HMAT: Cache: Domain:1 Size:131072 Attrs:00081111 SMBIOS Handles:0
> 
> Leaving default latency/bw as 0 if ok as spec says '0: the corresponding latency
> or bandwidth information is not provided.'. Potentially the kernel could display this better.

Yes, we set default as 0, because spec said. But the kernel code has no 
warning just show 0. But in qemu-options.hx we note this:

if NUM is 0, means the corresponding latency or bandwidth information is 
not provided.

> 
> Also note https://marc.info/?l=linux-acpi&m=156506549410279&w=2 submitted as
> hmat_build_table_structs only calls build_hmat_mpda with flags=0 or HMAT_PROX_INIT_VALID (0x1) which is right looking at ACPI-6.3. An Ack/(Nack if I'm wrong) there would be good to have both kernel and this patch series working together.

Sounds good!
> 
> for entire series:
> 
> Reviewed-by: Daniel Black <daniel@linux.ibm.com>
>
Eric Blake Aug. 13, 2019, 3:11 p.m. UTC | #3
On 8/9/19 1:57 AM, Tao wrote:
> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> Add -numa hmat-lb option to provide System Locality Latency and
> Bandwidth Information. These memory attributes help to build
> System Locality Latency and Bandwidth Information Structure(s)
> in ACPI Heterogeneous Memory Attribute Table (HMAT).
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v9:
>     - change the CLI input way, make it more user firendly (Daniel Black)
>     use latency=NUM[p|n|u]s and bandwidth=NUM[M|G|P](B/s) as input and drop
>     the base-lat and base-bw input.

Why are you hand-rolling yet another scaling parser instead of reusing
one that's already in-tree?

> +++ b/hw/core/numa.c

> +void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
> +                        Error **errp)
> +{

> +    if (node->has_latency) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->latency[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the latency for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        ret = qemu_strtoui(node->latency, &endptr, 10, &latency);
> +        if (ret < 0) {
> +            error_setg(errp, "Invalid latency %s", node->latency);
> +            return;
> +        }
> +
> +        if (*endptr == '\0') {
> +            base_lat = 1;
> +        } else if (*(endptr + 1) == 's') {
> +            switch (*endptr) {
> +            case 'p':
> +                base_lat = 1;
> +                break;
> +            case 'n':
> +                base_lat = PICO_PER_NSEC;
> +                break;
> +            case 'u':
> +                base_lat = PICO_PER_USEC;
> +                break;

Hmm - this is a different scaling than any of our existing parsers
(which assume multiples k/M/G..., not subdivisions u/n/s)


> +    if (node->has_bandwidth) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the bandwidth for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        ret = qemu_strtoui(node->bandwidth, &endptr, 10, &bandwidth);
> +        if (ret < 0) {
> +            error_setg(errp, "Invalid bandwidth %s", node->bandwidth);
> +            return;
> +        }
> +
> +        switch (toupper(*endptr)) {
> +        case '\0':
> +        case 'M':
> +            base_bw = 1;
> +            break;
> +        case 'G':
> +            base_bw = UINT64_C(1) << 10;
> +            break;
> +        case 'P':
> +            base_bw = UINT64_C(1) << 20;
> +            break;

But this one, in addition to being wrong (P is 1<<30, not 1<<20), should
definitely be reusing qemu_strtosz_metric() or similar (look in
util/cutils.c).


> +++ b/qapi/machine.json
> @@ -377,10 +377,12 @@
>  #
>  # @cpu: property based CPU(s) to node mapping (Since: 2.10)
>  #
> +# @hmat-lb: memory latency and bandwidth information (Since: 4.2)
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node', 'dist', 'cpu' ] }
> +  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
>  

> +##
> +# @HmatLBDataType:
> +#
> +# Data type in the System Locality Latency
> +# and Bandwidth Information Structure of HMAT (Heterogeneous
> +# Memory Attribute Table)
> +#
> +# For more information of @HmatLBDataType see
> +# the chapter 5.2.27.4: Table 5-142:  Field "Data Type" of ACPI 6.3 spec.
> +#
> +# @access-latency: access latency (picoseconds)
> +#
> +# @read-latency: read latency (picoseconds)
> +#
> +# @write-latency: write latency (picoseconds)
> +#
> +# @access-bandwidth: access bandwidth (MB/s)
> +#
> +# @read-bandwidth: read bandwidth (MB/s)
> +#
> +# @write-bandwidth: write bandwidth (MB/s)

Are these really the best scales?


> +
> +##
> +# @NumaHmatLBOptions:
> +#
> +# Set the system locality latency and bandwidth information
> +# between Initiator and Target proximity Domains.
> +#
> +# For more information of @NumaHmatLBOptions see
> +# the chapter 5.2.27.4: Table 5-142 of ACPI 6.3 spec.
> +#
> +# @initiator: the Initiator Proximity Domain.
> +#
> +# @target: the Target Proximity Domain.
> +#
> +# @hierarchy: the Memory Hierarchy. Indicates the performance
> +#             of memory or side cache.
> +#
> +# @data-type: presents the type of data, access/read/write
> +#             latency or hit latency.
> +#
> +# @latency: the value of latency from @initiator to @target proximity domain,
> +#           the latency units are "ps(picosecond)", "ns(nanosecond)" or
> +#           "us(microsecond)".
> +#
> +# @bandwidth: the value of bandwidth between @initiator and @target proximity
> +#             domain, the bandwidth units are "MB(/s)","GB(/s)" or "PB(/s)".
> +#
> +# Since: 4.2
> +##
> +{ 'struct': 'NumaHmatLBOptions',
> +    'data': {
> +    'initiator': 'uint16',
> +    'target': 'uint16',
> +    'hierarchy': 'HmatLBMemoryHierarchy',
> +    'data-type': 'HmatLBDataType',
> +    '*latency': 'str',
> +    '*bandwidth': 'str' }}

...and then parsing strings instead of taking raw integers?  Parsing
strings is okay for HMP, but for QMP, our goal should be a single
representation with no additional sugar on top.  Latency and bandwidth
should be int in a single scale.


> +++ b/qemu-options.hx
> @@ -164,16 +164,19 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>      "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>      "-numa dist,src=source,dst=destination,val=distance\n"
> -    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
> +    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
> +    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n",

Command-line parsing can then take human-written scaled numbers, and
pre-convert them into the single scale accepted by the QMP interface.
Tao Xu Aug. 14, 2019, 2:58 a.m. UTC | #4
On 8/13/2019 11:11 PM, Eric Blake wrote:
> On 8/9/19 1:57 AM, Tao wrote:
>> From: Liu Jingqi <jingqi.liu@intel.com>
>>
>> Add -numa hmat-lb option to provide System Locality Latency and
>> Bandwidth Information. These memory attributes help to build
>> System Locality Latency and Bandwidth Information Structure(s)
>> in ACPI Heterogeneous Memory Attribute Table (HMAT).
>>
>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
>>
>> Changes in v9:
>>      - change the CLI input way, make it more user firendly (Daniel Black)
>>      use latency=NUM[p|n|u]s and bandwidth=NUM[M|G|P](B/s) as input and drop
>>      the base-lat and base-bw input.
> 
> Why are you hand-rolling yet another scaling parser instead of reusing
> one that's already in-tree?

Because there are no time scaling parser and QMP 'size' type will use kb 
as default. It is a tricky issue because the entry in HMAT is small(max 
0xffff) and we need to store the unit in HMAT.

But as you mentioned blew, 'str' is not a good choice for QMP.
Therefore, what about this solution:

For bandwidth, reuse the qemu_strtosz_MiB() (because the smllest unit is 
MB/s). For latency, write a time scaling parser named as 
"qemu_strtotime_ps()" and "qemu_strtotime_ns()" in util/cutils.c. And 
then use it to pre-convert them into the single scale (QMP interface can 
use).

At last, in HMAT, we auto store the data, separate it into the same base 
unit and entry, and show error if overflow. Then the HMAT can support as 
large as possible.

I am wondering if this solution is OK.
> 
>> +++ b/hw/core/numa.c
> 
>> +void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
>> +                        Error **errp)
>> +{
> 
>> +    if (node->has_latency) {
>> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
>> +
>> +        if (!hmat_lb) {
>> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
>> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
>> +        } else if (hmat_lb->latency[node->initiator][node->target]) {
>> +            error_setg(errp, "Duplicate configuration of the latency for "
>> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
>> +                       node->initiator, node->target);
>> +            return;
>> +        }
>> +
>> +        ret = qemu_strtoui(node->latency, &endptr, 10, &latency);
>> +        if (ret < 0) {
>> +            error_setg(errp, "Invalid latency %s", node->latency);
>> +            return;
>> +        }
>> +
>> +        if (*endptr == '\0') {
>> +            base_lat = 1;
>> +        } else if (*(endptr + 1) == 's') {
>> +            switch (*endptr) {
>> +            case 'p':
>> +                base_lat = 1;
>> +                break;
>> +            case 'n':
>> +                base_lat = PICO_PER_NSEC;
>> +                break;
>> +            case 'u':
>> +                base_lat = PICO_PER_USEC;
>> +                break;
> 
> Hmm - this is a different scaling than any of our existing parsers
> (which assume multiples k/M/G..., not subdivisions u/n/s)
> 
> 
>> +    if (node->has_bandwidth) {
>> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
>> +
>> +        if (!hmat_lb) {
>> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
>> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
>> +        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
>> +            error_setg(errp, "Duplicate configuration of the bandwidth for "
>> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
>> +                       node->initiator, node->target);
>> +            return;
>> +        }
>> +
>> +        ret = qemu_strtoui(node->bandwidth, &endptr, 10, &bandwidth);
>> +        if (ret < 0) {
>> +            error_setg(errp, "Invalid bandwidth %s", node->bandwidth);
>> +            return;
>> +        }
>> +
>> +        switch (toupper(*endptr)) {
>> +        case '\0':
>> +        case 'M':
>> +            base_bw = 1;
>> +            break;
>> +        case 'G':
>> +            base_bw = UINT64_C(1) << 10;
>> +            break;
>> +        case 'P':
>> +            base_bw = UINT64_C(1) << 20;
>> +            break;
> 
> But this one, in addition to being wrong (P is 1<<30, not 1<<20), should
> definitely be reusing qemu_strtosz_metric() or similar (look in
> util/cutils.c).
> 
> 
>> +++ b/qapi/machine.json
>> @@ -377,10 +377,12 @@
>>   #
>>   # @cpu: property based CPU(s) to node mapping (Since: 2.10)
>>   #
>> +# @hmat-lb: memory latency and bandwidth information (Since: 4.2)
>> +#
>>   # Since: 2.1
>>   ##
>>   { 'enum': 'NumaOptionsType',
>> -  'data': [ 'node', 'dist', 'cpu' ] }
>> +  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
>>   
> 
>> +##
>> +# @HmatLBDataType:
>> +#
>> +# Data type in the System Locality Latency
>> +# and Bandwidth Information Structure of HMAT (Heterogeneous
>> +# Memory Attribute Table)
>> +#
>> +# For more information of @HmatLBDataType see
>> +# the chapter 5.2.27.4: Table 5-142:  Field "Data Type" of ACPI 6.3 spec.
>> +#
>> +# @access-latency: access latency (picoseconds)
>> +#
>> +# @read-latency: read latency (picoseconds)
>> +#
>> +# @write-latency: write latency (picoseconds)
>> +#
>> +# @access-bandwidth: access bandwidth (MB/s)
>> +#
>> +# @read-bandwidth: read bandwidth (MB/s)
>> +#
>> +# @write-bandwidth: write bandwidth (MB/s)
> 
> Are these really the best scales?
> 
> 
>> +
>> +##
>> +# @NumaHmatLBOptions:
>> +#
>> +# Set the system locality latency and bandwidth information
>> +# between Initiator and Target proximity Domains.
>> +#
>> +# For more information of @NumaHmatLBOptions see
>> +# the chapter 5.2.27.4: Table 5-142 of ACPI 6.3 spec.
>> +#
>> +# @initiator: the Initiator Proximity Domain.
>> +#
>> +# @target: the Target Proximity Domain.
>> +#
>> +# @hierarchy: the Memory Hierarchy. Indicates the performance
>> +#             of memory or side cache.
>> +#
>> +# @data-type: presents the type of data, access/read/write
>> +#             latency or hit latency.
>> +#
>> +# @latency: the value of latency from @initiator to @target proximity domain,
>> +#           the latency units are "ps(picosecond)", "ns(nanosecond)" or
>> +#           "us(microsecond)".
>> +#
>> +# @bandwidth: the value of bandwidth between @initiator and @target proximity
>> +#             domain, the bandwidth units are "MB(/s)","GB(/s)" or "PB(/s)".
>> +#
>> +# Since: 4.2
>> +##
>> +{ 'struct': 'NumaHmatLBOptions',
>> +    'data': {
>> +    'initiator': 'uint16',
>> +    'target': 'uint16',
>> +    'hierarchy': 'HmatLBMemoryHierarchy',
>> +    'data-type': 'HmatLBDataType',
>> +    '*latency': 'str',
>> +    '*bandwidth': 'str' }}
> 
> ...and then parsing strings instead of taking raw integers?  Parsing
> strings is okay for HMP, but for QMP, our goal should be a single
> representation with no additional sugar on top.  Latency and bandwidth
> should be int in a single scale.
> 
> 
>> +++ b/qemu-options.hx
>> @@ -164,16 +164,19 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>>       "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>>       "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>>       "-numa dist,src=source,dst=destination,val=distance\n"
>> -    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
>> +    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
>> +    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n",
> 
> Command-line parsing can then take human-written scaled numbers, and
> pre-convert them into the single scale accepted by the QMP interface.
>
diff mbox series

Patch

diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index 6c32f12e78..b7c1e02cf0 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -42,6 +42,9 @@ 
 
 #define HMAT_IS_LATENCY(type) (type <= HMAT_LB_DATA_WRITE_LATENCY)
 
+#define PICO_PER_USEC 1000000
+#define PICO_PER_NSEC 1000
+
 struct HMAT_LB_Info {
     /*
      * Indicates total number of Proximity Domains
diff --git a/hw/core/numa.c b/hw/core/numa.c
index cfb6339810..9a494145f3 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -37,6 +37,7 @@ 
 #include "qemu/option.h"
 #include "qemu/config-file.h"
 #include "qemu/cutils.h"
+#include "hw/acpi/hmat.h"
 
 QemuOptsList qemu_numa_opts = {
     .name = "numa",
@@ -183,6 +184,184 @@  void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
     ms->numa_state->have_numa_distance = true;
 }
 
+void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
+                        Error **errp)
+{
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    HMAT_LB_Info *hmat_lb = NULL;
+    const char *endptr;
+    int ret;
+    uint32_t latency, bandwidth;
+    uint64_t base_lat = 0, base_bw = 0;
+
+    if (node->data_type <= HMATLB_DATA_TYPE_WRITE_LATENCY) {
+        if (!node->has_latency) {
+            error_setg(errp, "Missing 'latency' option.");
+            return;
+        }
+        if (node->has_bandwidth) {
+            error_setg(errp, "Invalid option 'bandwidth' since "
+                       "the data type is latency.");
+            return;
+        }
+    }
+
+    if (node->data_type >= HMATLB_DATA_TYPE_ACCESS_BANDWIDTH) {
+        if (!node->has_bandwidth) {
+            error_setg(errp, "Missing 'bandwidth' option.");
+            return;
+        }
+        if (node->has_latency) {
+            error_setg(errp, "Invalid option 'latency' since "
+                       "the data type is bandwidth.");
+            return;
+        }
+    }
+
+    if (node->initiator >= nb_numa_nodes) {
+        error_setg(errp, "Invalid initiator=%"
+                   PRIu16 ", it should be less than %d.",
+                   node->initiator, nb_numa_nodes);
+        return;
+    }
+    if (!numa_info[node->initiator].has_cpu) {
+        error_setg(errp, "Invalid initiator=%"
+                   PRIu16 ", it isn't an initiator proximity domain.",
+                   node->initiator);
+        return;
+    }
+
+    if (node->target >= nb_numa_nodes) {
+        error_setg(errp, "Invalid target=%"
+                   PRIu16 ", it should be less than %d.",
+                   node->target, nb_numa_nodes);
+        return;
+    }
+    if (!numa_info[node->target].initiator_valid) {
+        error_setg(errp, "Invalid target=%"
+                   PRIu16 ", it hasn't a valid initiator proximity domain.",
+                   node->target);
+        return;
+    }
+
+    if (node->has_latency) {
+        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+        if (!hmat_lb) {
+            hmat_lb = g_malloc0(sizeof(*hmat_lb));
+            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
+        } else if (hmat_lb->latency[node->initiator][node->target]) {
+            error_setg(errp, "Duplicate configuration of the latency for "
+                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+                       node->initiator, node->target);
+            return;
+        }
+
+        ret = qemu_strtoui(node->latency, &endptr, 10, &latency);
+        if (ret < 0) {
+            error_setg(errp, "Invalid latency %s", node->latency);
+            return;
+        }
+
+        if (*endptr == '\0') {
+            base_lat = 1;
+        } else if (*(endptr + 1) == 's') {
+            switch (*endptr) {
+            case 'p':
+                base_lat = 1;
+                break;
+            case 'n':
+                base_lat = PICO_PER_NSEC;
+                break;
+            case 'u':
+                base_lat = PICO_PER_USEC;
+                break;
+            }
+        } else {
+            error_setg(errp, "Invalid latency unit %s,"
+                "vaild units are \"ps\" \"ns\" \"us\"", node->latency);
+            return;
+        }
+
+        /* Only the first time of setting the base unit is valid. */
+        if (hmat_lb->base_lat == 0) {
+            hmat_lb->base_lat = base_lat;
+        } else if (hmat_lb->base_lat != base_lat) {
+            error_setg(errp, "Invalid latency unit %s,"
+                " please unify the units.", node->latency);
+            return;
+        }
+
+        if (latency >= UINT16_MAX) {
+            error_setg(errp, "Latency value %s overflow, max value"
+                " is %" PRIu16, node->latency, UINT16_MAX - 1);
+        }
+
+        hmat_lb->latency[node->initiator][node->target] = latency;
+    }
+
+    if (node->has_bandwidth) {
+        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+        if (!hmat_lb) {
+            hmat_lb = g_malloc0(sizeof(*hmat_lb));
+            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
+        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
+            error_setg(errp, "Duplicate configuration of the bandwidth for "
+                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+                       node->initiator, node->target);
+            return;
+        }
+
+        ret = qemu_strtoui(node->bandwidth, &endptr, 10, &bandwidth);
+        if (ret < 0) {
+            error_setg(errp, "Invalid bandwidth %s", node->bandwidth);
+            return;
+        }
+
+        switch (toupper(*endptr)) {
+        case '\0':
+        case 'M':
+            base_bw = 1;
+            break;
+        case 'G':
+            base_bw = UINT64_C(1) << 10;
+            break;
+        case 'P':
+            base_bw = UINT64_C(1) << 20;
+            break;
+        }
+
+        if (base_bw == 0) {
+            error_setg(errp, "Invalid bandwidth unit %s,"
+                " vaild units are \"M\" \"G\" \"P\"", node->bandwidth);
+            return;
+        }
+
+        /* Only the first time of setting the base unit is valid. */
+        if (hmat_lb->base_bw == 0) {
+            hmat_lb->base_bw = base_bw;
+        } else if (hmat_lb->base_lat != base_lat) {
+            error_setg(errp, "Invalid bandwidth unit %s,"
+                " please unify the units.", node->bandwidth);
+            return;
+        }
+
+        if (bandwidth >= UINT16_MAX) {
+            error_setg(errp, "Bandwidth value %s overflow, max value"
+                " is %" PRIu16, node->bandwidth, UINT16_MAX - 1);
+        }
+
+        hmat_lb->bandwidth[node->initiator][node->target] = bandwidth;
+    }
+
+    if (hmat_lb) {
+        hmat_lb->hierarchy = node->hierarchy;
+        hmat_lb->data_type = node->data_type;
+    }
+}
+
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
     Error *err = NULL;
@@ -221,6 +400,12 @@  void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
         machine_set_cpu_numa_node(ms, qapi_NumaCpuOptions_base(&object->u.cpu),
                                   &err);
         break;
+    case NUMA_OPTIONS_TYPE_HMAT_LB:
+        parse_numa_hmat_lb(ms, &object->u.hmat_lb, &err);
+        if (err) {
+            goto end;
+        }
+        break;
     default:
         abort();
     }
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 1ed3362917..f0857b7ee6 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -41,6 +41,8 @@  typedef struct NumaState NumaState;
 
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp);
 void parse_numa_opts(MachineState *ms);
+void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
+                        Error **errp);
 void numa_complete_configuration(MachineState *ms);
 void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms);
 extern QemuOptsList qemu_numa_opts;
diff --git a/qapi/machine.json b/qapi/machine.json
index 05e367d26a..f19c761e52 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -377,10 +377,12 @@ 
 #
 # @cpu: property based CPU(s) to node mapping (Since: 2.10)
 #
+# @hmat-lb: memory latency and bandwidth information (Since: 4.2)
+#
 # Since: 2.1
 ##
 { 'enum': 'NumaOptionsType',
-  'data': [ 'node', 'dist', 'cpu' ] }
+  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
 
 ##
 # @NumaOptions:
@@ -395,7 +397,8 @@ 
   'data': {
     'node': 'NumaNodeOptions',
     'dist': 'NumaDistOptions',
-    'cpu': 'NumaCpuOptions' }}
+    'cpu': 'NumaCpuOptions',
+    'hmat-lb': 'NumaHmatLBOptions' }}
 
 ##
 # @NumaNodeOptions:
@@ -504,6 +507,94 @@ 
    'base': 'CpuInstanceProperties',
    'data' : {} }
 
+##
+# @HmatLBMemoryHierarchy:
+#
+# The memory hierarchy in the System Locality Latency
+# and Bandwidth Information Structure of HMAT (Heterogeneous
+# Memory Attribute Table)
+#
+# For more information of @HmatLBMemoryHierarchy see
+# the chapter 5.2.27.4: Table 5-142: Field "Flags" of ACPI 6.3 spec.
+#
+# @memory: the structure represents the memory performance
+#
+# @first-level: first level memory of memory side cached memory
+#
+# @second-level: second level memory of memory side cached memory
+#
+# @third-level: third level memory of memory side cached memory
+#
+# Since: 4.2
+##
+{ 'enum': 'HmatLBMemoryHierarchy',
+  'data': [ 'memory', 'first-level', 'second-level', 'third-level' ] }
+
+##
+# @HmatLBDataType:
+#
+# Data type in the System Locality Latency
+# and Bandwidth Information Structure of HMAT (Heterogeneous
+# Memory Attribute Table)
+#
+# For more information of @HmatLBDataType see
+# the chapter 5.2.27.4: Table 5-142:  Field "Data Type" of ACPI 6.3 spec.
+#
+# @access-latency: access latency (picoseconds)
+#
+# @read-latency: read latency (picoseconds)
+#
+# @write-latency: write latency (picoseconds)
+#
+# @access-bandwidth: access bandwidth (MB/s)
+#
+# @read-bandwidth: read bandwidth (MB/s)
+#
+# @write-bandwidth: write bandwidth (MB/s)
+#
+# Since: 4.2
+##
+{ 'enum': 'HmatLBDataType',
+  'data': [ 'access-latency', 'read-latency', 'write-latency',
+            'access-bandwidth', 'read-bandwidth', 'write-bandwidth' ] }
+
+##
+# @NumaHmatLBOptions:
+#
+# Set the system locality latency and bandwidth information
+# between Initiator and Target proximity Domains.
+#
+# For more information of @NumaHmatLBOptions see
+# the chapter 5.2.27.4: Table 5-142 of ACPI 6.3 spec.
+#
+# @initiator: the Initiator Proximity Domain.
+#
+# @target: the Target Proximity Domain.
+#
+# @hierarchy: the Memory Hierarchy. Indicates the performance
+#             of memory or side cache.
+#
+# @data-type: presents the type of data, access/read/write
+#             latency or hit latency.
+#
+# @latency: the value of latency from @initiator to @target proximity domain,
+#           the latency units are "ps(picosecond)", "ns(nanosecond)" or
+#           "us(microsecond)".
+#
+# @bandwidth: the value of bandwidth between @initiator and @target proximity
+#             domain, the bandwidth units are "MB(/s)","GB(/s)" or "PB(/s)".
+#
+# Since: 4.2
+##
+{ 'struct': 'NumaHmatLBOptions',
+    'data': {
+    'initiator': 'uint16',
+    'target': 'uint16',
+    'hierarchy': 'HmatLBMemoryHierarchy',
+    'data-type': 'HmatLBDataType',
+    '*latency': 'str',
+    '*bandwidth': 'str' }}
+
 ##
 # @HostMemPolicy:
 #
diff --git a/qemu-options.hx b/qemu-options.hx
index c480781992..cda4607f3a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -164,16 +164,19 @@  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
     "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
     "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
     "-numa dist,src=source,dst=destination,val=distance\n"
-    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
+    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
+    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
 @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
 @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
 @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
+@itemx -numa hmat-lb,initiator=@var{node},target=@var{node},hierarchy=@var{str},data-type=@var{str}[,latency=@var{lat}][,bandwidth=@var{bw}]
 @findex -numa
 Define a NUMA node and assign RAM and VCPUs to it.
 Set the NUMA distance from a source node to a destination node.
+Set the ACPI Heterogeneous Memory Attributes for the given nodes.
 
 Legacy VCPU assignment uses @samp{cpus} option where
 @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
@@ -250,6 +253,45 @@  specified resources, it just assigns existing resources to NUMA
 nodes. This means that one still has to use the @option{-m},
 @option{-smp} options to allocate RAM and VCPUs respectively.
 
+Use @samp{hmat-lb} to set System Locality Latency and Bandwidth Information
+between initiator and target NUMA nodes in ACPI Heterogeneous Attribute Memory Table (HMAT).
+Initiator NUMA node can create memory requests, usually including one or more processors.
+Target NUMA node contains addressable memory.
+
+In @samp{hmat-lb} option, @var{node} are NUMA node IDs. @var{str} of 'hierarchy'
+is the memory hierarchy of the target NUMA node: if @var{str} is 'memory', the structure
+represents the memory performance; if @var{str} is 'first-level|second-level|third-level',
+this structure represents aggregated performance of memory side caches for each domain.
+@var{str} of 'data-type' is type of data represented by this structure instance:
+if 'hierarchy' is 'memory', 'data-type' is 'access|read|write' latency(picoseconds)
+or 'access|read|write' bandwidth(MB/s) of the target memory; if 'hierarchy' is
+'first-level|second-level|third-level', 'data-type' is 'access|read|write' hit latency(picoseconds)
+or 'access|read|write' hit bandwidth of the target memory side cache.
+
+@var{lat} of 'latency' is latency value, the possible value and units are
+NUM[ps|ns|us] (picosecond|nanosecond|microsecond), default unit is 'ps'. @var{bw}
+is bandwidth value, the possible value and units are NUM[M|G|P], mean that
+the bandwidth value are NUM MB/s, GB/s or PB/s. Note that max NUM is 65534,
+if NUM is 0, means the corresponding latency or bandwidth information is not provided.
+
+For example, the following option assigns NUMA node 0 and 1. Node 0 has 2 cpus and
+a ram, node 1 has only a ram. The processors in node 0 access memory in node
+0 with access-latency 5 nanoseconds, access-bandwidth is 200 MB/s;
+The processors in NUMA node 0 access memory in NUMA node 1 with access-latency 10
+nanoseconds, access-bandwidth is 100 MB/s.
+@example
+-m 2G \
+-object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 -numa node,nodeid=0,memdev=ram-node0 \
+-object memory-backend-ram,size=1024M,policy=bind,host-nodes=1,id=ram-node1 -numa node,nodeid=1,memdev=ram-node1 \
+-smp 2 \
+-numa cpu,node-id=0,socket-id=0 \
+-numa cpu,node-id=0,socket-id=1 \
+-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5ns \
+-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
+-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10ns \
+-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \
+@end example
+
 ETEXI
 
 DEF("add-fd", HAS_ARG, QEMU_OPTION_add_fd,