Message ID | 20210628151117.545935-7-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Add support for FORM2 associativity | expand |
Related | show |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (0f7a719601eb957c10d417c62bd5f65080b5a409) |
snowpatch_ozlabs/build-ppc64le | warning | Build succeeded but added 1 new sparse warnings |
snowpatch_ozlabs/build-ppc64be | warning | Build succeeded but added 1 new sparse warnings |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | warning | total: 0 errors, 3 warnings, 3 checks, 360 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote: > PAPR interface currently supports two different ways of communicating resource > grouping details to the OS. These are referred to as Form 0 and Form 1 > associativity grouping. Form 0 is the older format and is now considered > deprecated. This patch adds another resource grouping named FORM2. > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > Documentation/powerpc/associativity.rst | 103 ++++++++++++++ > arch/powerpc/include/asm/firmware.h | 3 +- > arch/powerpc/include/asm/prom.h | 1 + > arch/powerpc/kernel/prom_init.c | 3 +- > arch/powerpc/mm/numa.c | 157 ++++++++++++++++++---- > arch/powerpc/platforms/pseries/firmware.c | 1 + > 6 files changed, 242 insertions(+), 26 deletions(-) > create mode 100644 Documentation/powerpc/associativity.rst > > diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst > new file mode 100644 > index 000000000000..31cc7da2c7a6 > --- /dev/null > +++ b/Documentation/powerpc/associativity.rst > @@ -0,0 +1,103 @@ > +============================ > +NUMA resource associativity > +============================= > + > +Associativity represents the groupings of the various platform resources into > +domains of substantially similar mean performance relative to resources outside > +of that domain. Resources subsets of a given domain that exhibit better > +performance relative to each other than relative to other resources subsets > +are represented as being members of a sub-grouping domain. This performance > +characteristic is presented in terms of NUMA node distance within the Linux kernel. > +From the platform view, these groups are also referred to as domains. Pretty hard to decipher, but that's typical for PAPR. > +PAPR interface currently supports different ways of communicating these resource > +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 > +associativity grouping. Form 0 is the older format and is now considered deprecated. Nit: s/older/oldest/ since there are now >2 forms. > +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property". > +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. > +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity > +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. > + > +Form 0 > +----- > +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE). > + > +Form 1 > +----- > +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity > +device tree properties are used to determine the NUMA distance between resource groups/domains. > + > +The “ibm,associativity” property contains a list of one or more numbers (domainID) > +representing the resource’s platform grouping domains. > + > +The “ibm,associativity-reference-points” property contains a list of one or more numbers > +(domainID index) that represents the 1 based ordinal in the associativity lists. > +The list of domainID indexes represents an increasing hierarchy of resource grouping. > + > +ex: > +{ primary domainID index, secondary domainID index, tertiary domainID index.. } > + > +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. > +Linux kernel computes NUMA distance between two domains by recursively comparing > +if they belong to the same higher-level domains. For mismatch at every higher > +level of the resource group, the kernel doubles the NUMA distance between the > +comparing domains. > + > +Form 2 > +------- > +Form 2 associativity format adds separate device tree properties representing NUMA node distance > +thereby making the node distance computation flexible. Form 2 also allows flexible primary > +domain numbering. With numa distance computation now detached from the index value in > +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain > +ids at the same domainID index representing resource groups of different performance/latency > +characteristics. > + > +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the > +"ibm,architecture-vec-5" property. > + > +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing > +the domainIDs present in the system. The offset of the domainID in this property is > +used as an index while computing numa distance information via "ibm,numa-distance-table". > + > +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by > +N domainID encoded as with encode-int > + > +For ex: > +"ibm,numa-lookup-index-table" = {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when > +computing the distance of domain 8 from other domains present in the system. For the rest of > +this document, this offset will be referred to as domain distance offset. > + > +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA > +distance between resource groups/domains present in the system. > + > +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by > +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. > +The number N must be equal to the square of m where m is the number of domainIDs in the > +numa-lookup-index-table. > + > +For ex: > +ibm,numa-lookup-index-table = {3, 0, 8, 40} > +ibm,numa-distance-table = {9, 10, 20, 80, 20, 10, 160, 80, 160, 10} This representation doesn't make it clear that the 9 is a u32, but the rest are u8s. > + > + | 0 8 40 > +--|------------ > + | > +0 | 10 20 80 > + | > +8 | 20 10 160 > + | > +40| 80 160 10 > + > +A possible "ibm,associativity" property for resources in node 0, 8 and 40 > + > +{ 3, 6, 7, 0 } > +{ 3, 6, 9, 8 } > +{ 3, 6, 7, 40} > + > +With "ibm,associativity-reference-points" { 0x3 } You haven't actually described how ibm,associativity-reference-points operates in Form2. > +"ibm,lookup-index-table" helps in having a compact representation of distance matrix. > +Since domainID can be sparse, the matrix of distances can also be effectively sparse. > +With "ibm,lookup-index-table" we can achieve a compact representation of > +distance information. > diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h > index 60b631161360..97a3bd9ffeb9 100644 > --- a/arch/powerpc/include/asm/firmware.h > +++ b/arch/powerpc/include/asm/firmware.h > @@ -53,6 +53,7 @@ > #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0000004000000000) > #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0000008000000000) > #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000) > +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000) > > #ifndef __ASSEMBLY__ > > @@ -73,7 +74,7 @@ enum { > FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | > FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | > FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | > - FW_FEATURE_RPT_INVALIDATE, > + FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY, > FW_FEATURE_PSERIES_ALWAYS = 0, > FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, > FW_FEATURE_POWERNV_ALWAYS = 0, > diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h > index df9fec9d232c..5c80152e8f18 100644 > --- a/arch/powerpc/include/asm/prom.h > +++ b/arch/powerpc/include/asm/prom.h > @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop, > #define OV5_XCMO 0x0440 /* Page Coalescing */ > #define OV5_FORM1_AFFINITY 0x0580 /* FORM1 NUMA affinity */ > #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ > +#define OV5_FORM2_AFFINITY 0x0520 /* Form2 NUMA affinity */ > #define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ > #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ > #define OV5_PFO_HW_RNG 0x1180 /* PFO Random Number Generator */ > diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c > index 5d9ea059594f..c483df6c9393 100644 > --- a/arch/powerpc/kernel/prom_init.c > +++ b/arch/powerpc/kernel/prom_init.c > @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = { > #else > 0, > #endif > - .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN), > + .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) | > + OV5_FEAT(OV5_FORM2_AFFINITY), > .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), > .micro_checkpoint = 0, > .reserved0 = 0, > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index c6293037a103..c68846fc9550 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells; > > #define FORM0_AFFINITY 0 > #define FORM1_AFFINITY 1 > +#define FORM2_AFFINITY 2 > static int affinity_form; > > #define MAX_DISTANCE_REF_POINTS 4 > static int max_associativity_domain_index; > static const __be32 *distance_ref_points; > static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; > +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = { > + [0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 } > +}; > +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; > > /* > * Allocate node_to_cpumask_map based on number of available nodes > @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu) > } > #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */ > > +/* > + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > + * info is found. > + */ > +static int associativity_to_nid(const __be32 *associativity) > +{ > + int nid = NUMA_NO_NODE; > + > + if (!numa_enabled) > + goto out; > + > + if (of_read_number(associativity, 1) >= primary_domain_index) > + nid = of_read_number(&associativity[primary_domain_index], 1); > + > + /* POWER4 LPAR uses 0xffff as invalid node */ > + if (nid == 0xffff || nid >= nr_node_ids) > + nid = NUMA_NO_NODE; > +out: > + return nid; > +} > + > +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > +{ > + int dist; > + int node1, node2; > + > + node1 = associativity_to_nid(cpu1_assoc); > + node2 = associativity_to_nid(cpu2_assoc); > + > + dist = numa_distance_table[node1][node2]; > + if (dist <= LOCAL_DISTANCE) > + return 0; > + else if (dist <= REMOTE_DISTANCE) > + return 1; > + else > + return 2; Squashing the full range of distances into just 0, 1 or 2 seems odd. But then, this whole cpu_distance() thing being distinct from node_distance() seems odd. > +} > + > static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > { > int dist = 0; > @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > { > /* We should not get called with FORM0 */ > VM_WARN_ON(affinity_form == FORM0_AFFINITY); > - > - return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); > + if (affinity_form == FORM1_AFFINITY) > + return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); > + return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc); > } > > /* must hold reference to node during call */ > @@ -201,7 +245,9 @@ int __node_distance(int a, int b) > int i; > int distance = LOCAL_DISTANCE; > > - if (affinity_form == FORM0_AFFINITY) > + if (affinity_form == FORM2_AFFINITY) > + return numa_distance_table[a][b]; > + else if (affinity_form == FORM0_AFFINITY) > return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); > > for (i = 0; i < max_associativity_domain_index; i++) { Hmm.. couldn't we simplify this whole __node_distance function, if we just update numa_distance_table[][] appropriately for Form0 and Form1 as well? > @@ -216,27 +262,6 @@ int __node_distance(int a, int b) > } > EXPORT_SYMBOL(__node_distance); > > -/* > - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > - * info is found. > - */ > -static int associativity_to_nid(const __be32 *associativity) > -{ > - int nid = NUMA_NO_NODE; > - > - if (!numa_enabled) > - goto out; > - > - if (of_read_number(associativity, 1) >= primary_domain_index) > - nid = of_read_number(&associativity[primary_domain_index], 1); > - > - /* POWER4 LPAR uses 0xffff as invalid node */ > - if (nid == 0xffff || nid >= nr_node_ids) > - nid = NUMA_NO_NODE; > -out: > - return nid; > -} > - > /* Returns the nid associated with the given device tree node, > * or -1 if not found. > */ > @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node) > */ > void update_numa_distance(struct device_node *node) > { > + int nid; > + > if (affinity_form == FORM0_AFFINITY) > return; > else if (affinity_form == FORM1_AFFINITY) { > initialize_form1_numa_distance(node); > return; > } > + > + /* FORM2 affinity */ > + nid = of_node_to_nid_single(node); > + if (nid == NUMA_NO_NODE) > + return; > + > + /* > + * With FORM2 we expect NUMA distance of all possible NUMA > + * nodes to be provided during boot. > + */ > + WARN(numa_distance_table[nid][nid] == -1, > + "NUMA distance details for node %d not provided\n", nid); > +} > + > +/* > + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} > + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} > + */ > +static void initialize_form2_numa_distance_lookup_table(struct device_node *root) > +{ > + int i, j; > + const __u8 *numa_dist_table; > + const __be32 *numa_lookup_index; > + int numa_dist_table_length; > + int max_numa_index, distance_index; > + > + numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); > + max_numa_index = of_read_number(&numa_lookup_index[0], 1); > + > + /* first element of the array is the size and is encode-int */ > + numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL); > + numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1); > + /* Skip the size which is encoded int */ > + numa_dist_table += sizeof(__be32); > + > + pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n", > + numa_dist_table_length, max_numa_index); > + > + for (i = 0; i < max_numa_index; i++) > + /* +1 skip the max_numa_index in the property */ > + numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1); > + > + > + if (numa_dist_table_length != max_numa_index * max_numa_index) { > + > + WARN(1, "Wrong NUMA distance information\n"); > + /* consider everybody else just remote. */ > + for (i = 0; i < max_numa_index; i++) { > + for (j = 0; j < max_numa_index; j++) { > + int nodeA = numa_id_index_table[i]; > + int nodeB = numa_id_index_table[j]; > + > + if (nodeA == nodeB) > + numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE; > + else > + numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE; > + } > + } > + } > + > + distance_index = 0; > + for (i = 0; i < max_numa_index; i++) { > + for (j = 0; j < max_numa_index; j++) { > + int nodeA = numa_id_index_table[i]; > + int nodeB = numa_id_index_table[j]; > + > + numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++]; > + pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); > + } > + } > } > > static int __init find_primary_domain_index(void) > @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void) > */ > if (firmware_has_feature(FW_FEATURE_OPAL)) { > affinity_form = FORM1_AFFINITY; > + } else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) { > + dbg("Using form 2 affinity\n"); > + affinity_form = FORM2_AFFINITY; > } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) { > dbg("Using form 1 affinity\n"); > affinity_form = FORM1_AFFINITY; > @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void) > > index = of_read_number(&distance_ref_points[1], 1); > } else { > + /* > + * Both FORM1 and FORM2 affinity find the primary domain details > + * at the same offset. > + */ > index = of_read_number(distance_ref_points, 1); > } > + /* > + * If it is FORM2 also initialize the distance table here. > + */ > + if (affinity_form == FORM2_AFFINITY) > + initialize_form2_numa_distance_lookup_table(root); Ew. Calling a function called "find_primary_domain_index" to also initialize the main distance table is needlessly counterintuitive. Move this call to parse_numa_properties(). > > /* > * Warn and cap if the hardware supports more than > diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c > index 5d4c2bc20bba..f162156b7b68 100644 > --- a/arch/powerpc/platforms/pseries/firmware.c > +++ b/arch/powerpc/platforms/pseries/firmware.c > @@ -123,6 +123,7 @@ vec5_fw_features_table[] = { > {FW_FEATURE_PRRN, OV5_PRRN}, > {FW_FEATURE_DRMEM_V2, OV5_DRMEM_V2}, > {FW_FEATURE_DRC_INFO, OV5_DRC_INFO}, > + {FW_FEATURE_FORM2_AFFINITY, OV5_FORM2_AFFINITY}, > }; > > static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
David Gibson <david@gibson.dropbear.id.au> writes: > On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote: >> PAPR interface currently supports two different ways of communicating resource >> grouping details to the OS. These are referred to as Form 0 and Form 1 >> associativity grouping. Form 0 is the older format and is now considered >> deprecated. This patch adds another resource grouping named FORM2. >> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> Documentation/powerpc/associativity.rst | 103 ++++++++++++++ >> arch/powerpc/include/asm/firmware.h | 3 +- >> arch/powerpc/include/asm/prom.h | 1 + >> arch/powerpc/kernel/prom_init.c | 3 +- >> arch/powerpc/mm/numa.c | 157 ++++++++++++++++++---- >> arch/powerpc/platforms/pseries/firmware.c | 1 + >> 6 files changed, 242 insertions(+), 26 deletions(-) >> create mode 100644 Documentation/powerpc/associativity.rst >> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst >> new file mode 100644 >> index 000000000000..31cc7da2c7a6 >> --- /dev/null >> +++ b/Documentation/powerpc/associativity.rst >> @@ -0,0 +1,103 @@ >> +============================ >> +NUMA resource associativity >> +============================= >> + >> +Associativity represents the groupings of the various platform resources into >> +domains of substantially similar mean performance relative to resources outside >> +of that domain. Resources subsets of a given domain that exhibit better >> +performance relative to each other than relative to other resources subsets >> +are represented as being members of a sub-grouping domain. This performance >> +characteristic is presented in terms of NUMA node distance within the Linux kernel. >> +From the platform view, these groups are also referred to as domains. > > Pretty hard to decipher, but that's typical for PAPR. > >> +PAPR interface currently supports different ways of communicating these resource >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 >> +associativity grouping. Form 0 is the older format and is now considered deprecated. > > Nit: s/older/oldest/ since there are now >2 forms. updated. > >> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property". >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. >> + >> +Form 0 >> +----- >> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE). >> + >> +Form 1 >> +----- >> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity >> +device tree properties are used to determine the NUMA distance between resource groups/domains. >> + >> +The “ibm,associativity” property contains a list of one or more numbers (domainID) >> +representing the resource’s platform grouping domains. >> + >> +The “ibm,associativity-reference-points” property contains a list of one or more numbers >> +(domainID index) that represents the 1 based ordinal in the associativity lists. >> +The list of domainID indexes represents an increasing hierarchy of resource grouping. >> + >> +ex: >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. } >> + >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. >> +Linux kernel computes NUMA distance between two domains by recursively comparing >> +if they belong to the same higher-level domains. For mismatch at every higher >> +level of the resource group, the kernel doubles the NUMA distance between the >> +comparing domains. >> + >> +Form 2 >> +------- >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary >> +domain numbering. With numa distance computation now detached from the index value in >> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain >> +ids at the same domainID index representing resource groups of different performance/latency >> +characteristics. >> + >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the >> +"ibm,architecture-vec-5" property. >> + >> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing >> +the domainIDs present in the system. The offset of the domainID in this property is >> +used as an index while computing numa distance information via "ibm,numa-distance-table". >> + >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by >> +N domainID encoded as with encode-int >> + >> +For ex: >> +"ibm,numa-lookup-index-table" = {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when >> +computing the distance of domain 8 from other domains present in the system. For the rest of >> +this document, this offset will be referred to as domain distance offset. >> + >> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA >> +distance between resource groups/domains present in the system. >> + >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. >> +The number N must be equal to the square of m where m is the number of domainIDs in the >> +numa-lookup-index-table. >> + >> +For ex: >> +ibm,numa-lookup-index-table = {3, 0, 8, 40} >> +ibm,numa-distance-table = {9, 10, 20, 80, 20, 10, 160, 80, 160, 10} > > This representation doesn't make it clear that the 9 is a u32, but the > rest are u8s. How do you suggest we specify that? I could do 9:u32 10:u8 etc. But considering the details are explained in the paragraph above, is that needed? > >> + >> + | 0 8 40 >> +--|------------ >> + | >> +0 | 10 20 80 >> + | >> +8 | 20 10 160 >> + | >> +40| 80 160 10 >> + >> +A possible "ibm,associativity" property for resources in node 0, 8 and 40 >> + >> +{ 3, 6, 7, 0 } >> +{ 3, 6, 9, 8 } >> +{ 3, 6, 7, 40} >> + >> +With "ibm,associativity-reference-points" { 0x3 } > > You haven't actually described how ibm,associativity-reference-points > operates in Form2. Nothing change w.r.t the definition of associativity-reference-points w.r.t FORM2. It still will continue to show the increasing hierarchy of resource groups. > >> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix. >> +Since domainID can be sparse, the matrix of distances can also be effectively sparse. >> +With "ibm,lookup-index-table" we can achieve a compact representation of >> +distance information. >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h >> index 60b631161360..97a3bd9ffeb9 100644 >> --- a/arch/powerpc/include/asm/firmware.h >> +++ b/arch/powerpc/include/asm/firmware.h >> @@ -53,6 +53,7 @@ >> #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0000004000000000) >> #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0000008000000000) >> #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000) >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000) >> >> #ifndef __ASSEMBLY__ >> >> @@ -73,7 +74,7 @@ enum { >> FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | >> FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | >> FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | >> - FW_FEATURE_RPT_INVALIDATE, >> + FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY, >> FW_FEATURE_PSERIES_ALWAYS = 0, >> FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, >> FW_FEATURE_POWERNV_ALWAYS = 0, >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h >> index df9fec9d232c..5c80152e8f18 100644 >> --- a/arch/powerpc/include/asm/prom.h >> +++ b/arch/powerpc/include/asm/prom.h >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop, >> #define OV5_XCMO 0x0440 /* Page Coalescing */ >> #define OV5_FORM1_AFFINITY 0x0580 /* FORM1 NUMA affinity */ >> #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ >> +#define OV5_FORM2_AFFINITY 0x0520 /* Form2 NUMA affinity */ >> #define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ >> #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ >> #define OV5_PFO_HW_RNG 0x1180 /* PFO Random Number Generator */ >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c >> index 5d9ea059594f..c483df6c9393 100644 >> --- a/arch/powerpc/kernel/prom_init.c >> +++ b/arch/powerpc/kernel/prom_init.c >> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = { >> #else >> 0, >> #endif >> - .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN), >> + .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) | >> + OV5_FEAT(OV5_FORM2_AFFINITY), >> .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), >> .micro_checkpoint = 0, >> .reserved0 = 0, >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> index c6293037a103..c68846fc9550 100644 >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells; >> >> #define FORM0_AFFINITY 0 >> #define FORM1_AFFINITY 1 >> +#define FORM2_AFFINITY 2 >> static int affinity_form; >> >> #define MAX_DISTANCE_REF_POINTS 4 >> static int max_associativity_domain_index; >> static const __be32 *distance_ref_points; >> static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = { >> + [0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 } >> +}; >> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; >> >> /* >> * Allocate node_to_cpumask_map based on number of available nodes >> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu) >> } >> #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */ >> >> +/* >> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA >> + * info is found. >> + */ >> +static int associativity_to_nid(const __be32 *associativity) >> +{ >> + int nid = NUMA_NO_NODE; >> + >> + if (!numa_enabled) >> + goto out; >> + >> + if (of_read_number(associativity, 1) >= primary_domain_index) >> + nid = of_read_number(&associativity[primary_domain_index], 1); >> + >> + /* POWER4 LPAR uses 0xffff as invalid node */ >> + if (nid == 0xffff || nid >= nr_node_ids) >> + nid = NUMA_NO_NODE; >> +out: >> + return nid; >> +} >> + >> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) >> +{ >> + int dist; >> + int node1, node2; >> + >> + node1 = associativity_to_nid(cpu1_assoc); >> + node2 = associativity_to_nid(cpu2_assoc); >> + >> + dist = numa_distance_table[node1][node2]; >> + if (dist <= LOCAL_DISTANCE) >> + return 0; >> + else if (dist <= REMOTE_DISTANCE) >> + return 1; >> + else >> + return 2; > > Squashing the full range of distances into just 0, 1 or 2 seems odd. > But then, this whole cpu_distance() thing being distinct from > node_distance() seems odd. > >> +} >> + >> static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) >> { >> int dist = 0; >> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) >> { >> /* We should not get called with FORM0 */ >> VM_WARN_ON(affinity_form == FORM0_AFFINITY); >> - >> - return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); >> + if (affinity_form == FORM1_AFFINITY) >> + return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); >> + return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc); >> } >> >> /* must hold reference to node during call */ >> @@ -201,7 +245,9 @@ int __node_distance(int a, int b) >> int i; >> int distance = LOCAL_DISTANCE; >> >> - if (affinity_form == FORM0_AFFINITY) >> + if (affinity_form == FORM2_AFFINITY) >> + return numa_distance_table[a][b]; >> + else if (affinity_form == FORM0_AFFINITY) >> return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); >> >> for (i = 0; i < max_associativity_domain_index; i++) { > > Hmm.. couldn't we simplify this whole __node_distance function, if we > just update numa_distance_table[][] appropriately for Form0 and Form1 > as well? IIUC what you are suggesting is to look at the possibility of using numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part of separate patch? > >> @@ -216,27 +262,6 @@ int __node_distance(int a, int b) >> } >> EXPORT_SYMBOL(__node_distance); >> >> -/* >> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA >> - * info is found. >> - */ >> -static int associativity_to_nid(const __be32 *associativity) >> -{ >> - int nid = NUMA_NO_NODE; >> - >> - if (!numa_enabled) >> - goto out; >> - >> - if (of_read_number(associativity, 1) >= primary_domain_index) >> - nid = of_read_number(&associativity[primary_domain_index], 1); >> - >> - /* POWER4 LPAR uses 0xffff as invalid node */ >> - if (nid == 0xffff || nid >= nr_node_ids) >> - nid = NUMA_NO_NODE; >> -out: >> - return nid; >> -} >> - >> /* Returns the nid associated with the given device tree node, >> * or -1 if not found. >> */ >> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node) >> */ >> void update_numa_distance(struct device_node *node) >> { >> + int nid; >> + >> if (affinity_form == FORM0_AFFINITY) >> return; >> else if (affinity_form == FORM1_AFFINITY) { >> initialize_form1_numa_distance(node); >> return; >> } >> + >> + /* FORM2 affinity */ >> + nid = of_node_to_nid_single(node); >> + if (nid == NUMA_NO_NODE) >> + return; >> + >> + /* >> + * With FORM2 we expect NUMA distance of all possible NUMA >> + * nodes to be provided during boot. >> + */ >> + WARN(numa_distance_table[nid][nid] == -1, >> + "NUMA distance details for node %d not provided\n", nid); >> +} >> + >> +/* >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} >> + */ >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root) >> +{ >> + int i, j; >> + const __u8 *numa_dist_table; >> + const __be32 *numa_lookup_index; >> + int numa_dist_table_length; >> + int max_numa_index, distance_index; >> + >> + numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); >> + max_numa_index = of_read_number(&numa_lookup_index[0], 1); >> + >> + /* first element of the array is the size and is encode-int */ >> + numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL); >> + numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1); >> + /* Skip the size which is encoded int */ >> + numa_dist_table += sizeof(__be32); >> + >> + pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n", >> + numa_dist_table_length, max_numa_index); >> + >> + for (i = 0; i < max_numa_index; i++) >> + /* +1 skip the max_numa_index in the property */ >> + numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1); >> + >> + >> + if (numa_dist_table_length != max_numa_index * max_numa_index) { >> + >> + WARN(1, "Wrong NUMA distance information\n"); >> + /* consider everybody else just remote. */ >> + for (i = 0; i < max_numa_index; i++) { >> + for (j = 0; j < max_numa_index; j++) { >> + int nodeA = numa_id_index_table[i]; >> + int nodeB = numa_id_index_table[j]; >> + >> + if (nodeA == nodeB) >> + numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE; >> + else >> + numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE; >> + } >> + } >> + } >> + >> + distance_index = 0; >> + for (i = 0; i < max_numa_index; i++) { >> + for (j = 0; j < max_numa_index; j++) { >> + int nodeA = numa_id_index_table[i]; >> + int nodeB = numa_id_index_table[j]; >> + >> + numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++]; >> + pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); >> + } >> + } >> } >> >> static int __init find_primary_domain_index(void) >> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void) >> */ >> if (firmware_has_feature(FW_FEATURE_OPAL)) { >> affinity_form = FORM1_AFFINITY; >> + } else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) { >> + dbg("Using form 2 affinity\n"); >> + affinity_form = FORM2_AFFINITY; >> } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) { >> dbg("Using form 1 affinity\n"); >> affinity_form = FORM1_AFFINITY; >> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void) >> >> index = of_read_number(&distance_ref_points[1], 1); >> } else { >> + /* >> + * Both FORM1 and FORM2 affinity find the primary domain details >> + * at the same offset. >> + */ >> index = of_read_number(distance_ref_points, 1); >> } >> + /* >> + * If it is FORM2 also initialize the distance table here. >> + */ >> + if (affinity_form == FORM2_AFFINITY) >> + initialize_form2_numa_distance_lookup_table(root); > > Ew. Calling a function called "find_primary_domain_index" to also > initialize the main distance table is needlessly counterintuitive. > Move this call to parse_numa_properties(). The reason I ended up doing it here is because 'root' is already fetched here. But I agree it is confusing. I will move fetching of root inside initialize_form2_numa_distance_lookup_table() and move the function outside primary_index lookup. modified arch/powerpc/mm/numa.c @@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node) * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} */ -static void initialize_form2_numa_distance_lookup_table(struct device_node *root) +static void initialize_form2_numa_distance_lookup_table() { int i, j; + struct device_node *root; const __u8 *numa_dist_table; const __be32 *numa_lookup_index; int numa_dist_table_length; int max_numa_index, distance_index; + if (firmware_has_feature(FW_FEATURE_OPAL)) + root = of_find_node_by_path("/ibm,opal"); + else + root = of_find_node_by_path("/rtas"); + if (!root) + root = of_find_node_by_path("/"); + numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); max_numa_index = of_read_number(&numa_lookup_index[0], 1); @@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); } } + of_node_put(root); } static int __init find_primary_domain_index(void) @@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void) */ index = of_read_number(distance_ref_points, 1); } - /* - * If it is FORM2 also initialize the distance table here. - */ - if (affinity_form == FORM2_AFFINITY) - initialize_form2_numa_distance_lookup_table(root); - /* * Warn and cap if the hardware supports more than * MAX_DISTANCE_REF_POINTS domains. @@ -916,6 +919,12 @@ static int __init parse_numa_properties(void) dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index); + /* + * If it is FORM2 also initialize the distance table here. + */ + if (affinity_form == FORM2_AFFINITY) + initialize_form2_numa_distance_lookup_table(); + /* * Even though we connect cpus to numa domains later in SMP * init, we need to know the node ids now. This is because -aneesh
On Thu, Jul 22, 2021 at 01:04:42PM +0530, Aneesh Kumar K.V wrote: > David Gibson <david@gibson.dropbear.id.au> writes: > > > On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote: > >> PAPR interface currently supports two different ways of communicating resource > >> grouping details to the OS. These are referred to as Form 0 and Form 1 > >> associativity grouping. Form 0 is the older format and is now considered > >> deprecated. This patch adds another resource grouping named FORM2. > >> > >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > >> --- > >> Documentation/powerpc/associativity.rst | 103 ++++++++++++++ > >> arch/powerpc/include/asm/firmware.h | 3 +- > >> arch/powerpc/include/asm/prom.h | 1 + > >> arch/powerpc/kernel/prom_init.c | 3 +- > >> arch/powerpc/mm/numa.c | 157 ++++++++++++++++++---- > >> arch/powerpc/platforms/pseries/firmware.c | 1 + > >> 6 files changed, 242 insertions(+), 26 deletions(-) > >> create mode 100644 Documentation/powerpc/associativity.rst > >> > >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst > >> new file mode 100644 > >> index 000000000000..31cc7da2c7a6 > >> --- /dev/null > >> +++ b/Documentation/powerpc/associativity.rst > >> @@ -0,0 +1,103 @@ > >> +============================ > >> +NUMA resource associativity > >> +============================= > >> + > >> +Associativity represents the groupings of the various platform resources into > >> +domains of substantially similar mean performance relative to resources outside > >> +of that domain. Resources subsets of a given domain that exhibit better > >> +performance relative to each other than relative to other resources subsets > >> +are represented as being members of a sub-grouping domain. This performance > >> +characteristic is presented in terms of NUMA node distance within the Linux kernel. > >> +From the platform view, these groups are also referred to as domains. > > > > Pretty hard to decipher, but that's typical for PAPR. > > > >> +PAPR interface currently supports different ways of communicating these resource > >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 > >> +associativity grouping. Form 0 is the older format and is now considered deprecated. > > > > Nit: s/older/oldest/ since there are now >2 forms. > > updated. > > > > >> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property". > >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. > >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity > >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. > >> + > >> +Form 0 > >> +----- > >> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE). > >> + > >> +Form 1 > >> +----- > >> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity > >> +device tree properties are used to determine the NUMA distance between resource groups/domains. > >> + > >> +The “ibm,associativity” property contains a list of one or more numbers (domainID) > >> +representing the resource’s platform grouping domains. > >> + > >> +The “ibm,associativity-reference-points” property contains a list of one or more numbers > >> +(domainID index) that represents the 1 based ordinal in the associativity lists. > >> +The list of domainID indexes represents an increasing hierarchy of resource grouping. > >> + > >> +ex: > >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. } > >> + > >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. > >> +Linux kernel computes NUMA distance between two domains by recursively comparing > >> +if they belong to the same higher-level domains. For mismatch at every higher > >> +level of the resource group, the kernel doubles the NUMA distance between the > >> +comparing domains. > >> + > >> +Form 2 > >> +------- > >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance > >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary > >> +domain numbering. With numa distance computation now detached from the index value in > >> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain > >> +ids at the same domainID index representing resource groups of different performance/latency > >> +characteristics. > >> + > >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the > >> +"ibm,architecture-vec-5" property. > >> + > >> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing > >> +the domainIDs present in the system. The offset of the domainID in this property is > >> +used as an index while computing numa distance information via "ibm,numa-distance-table". > >> + > >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by > >> +N domainID encoded as with encode-int > >> + > >> +For ex: > >> +"ibm,numa-lookup-index-table" = {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when > >> +computing the distance of domain 8 from other domains present in the system. For the rest of > >> +this document, this offset will be referred to as domain distance offset. > >> + > >> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA > >> +distance between resource groups/domains present in the system. > >> + > >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by > >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. > >> +The number N must be equal to the square of m where m is the number of domainIDs in the > >> +numa-lookup-index-table. > >> + > >> +For ex: > >> +ibm,numa-lookup-index-table = {3, 0, 8, 40} > >> +ibm,numa-distance-table = {9, 10, 20, 80, 20, 10, 160, 80, 160, 10} > > > > This representation doesn't make it clear that the 9 is a u32, but the > > rest are u8s. > > How do you suggest we specify that? I could do 9:u32 10:u8 etc. But > considering the details are explained in the paragraph above, is that > needed? Yes, I think it is needed. The examples are, honestly, a lot easier to read and follow than the PAPR-ese text, so people are much more likely to be looking at those than parsing the minutiae of the text. > >> + > >> + | 0 8 40 > >> +--|------------ > >> + | > >> +0 | 10 20 80 > >> + | > >> +8 | 20 10 160 > >> + | > >> +40| 80 160 10 > >> + > >> +A possible "ibm,associativity" property for resources in node 0, 8 and 40 > >> + > >> +{ 3, 6, 7, 0 } > >> +{ 3, 6, 9, 8 } > >> +{ 3, 6, 7, 40} > >> + > >> +With "ibm,associativity-reference-points" { 0x3 } > > > > You haven't actually described how ibm,associativity-reference-points > > operates in Form2. > > Nothing change w.r.t the definition of associativity-reference-points > w.r.t FORM2. It still will continue to show the increasing hierarchy of > resource groups. I guess, except that really none of them matter except the primary any more. > > > > >> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix. > >> +Since domainID can be sparse, the matrix of distances can also be effectively sparse. > >> +With "ibm,lookup-index-table" we can achieve a compact representation of > >> +distance information. > >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h > >> index 60b631161360..97a3bd9ffeb9 100644 > >> --- a/arch/powerpc/include/asm/firmware.h > >> +++ b/arch/powerpc/include/asm/firmware.h > >> @@ -53,6 +53,7 @@ > >> #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0000004000000000) > >> #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0000008000000000) > >> #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000) > >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000) > >> > >> #ifndef __ASSEMBLY__ > >> > >> @@ -73,7 +74,7 @@ enum { > >> FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | > >> FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | > >> FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | > >> - FW_FEATURE_RPT_INVALIDATE, > >> + FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY, > >> FW_FEATURE_PSERIES_ALWAYS = 0, > >> FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, > >> FW_FEATURE_POWERNV_ALWAYS = 0, > >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h > >> index df9fec9d232c..5c80152e8f18 100644 > >> --- a/arch/powerpc/include/asm/prom.h > >> +++ b/arch/powerpc/include/asm/prom.h > >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop, > >> #define OV5_XCMO 0x0440 /* Page Coalescing */ > >> #define OV5_FORM1_AFFINITY 0x0580 /* FORM1 NUMA affinity */ > >> #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ > >> +#define OV5_FORM2_AFFINITY 0x0520 /* Form2 NUMA affinity */ > >> #define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ > >> #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ > >> #define OV5_PFO_HW_RNG 0x1180 /* PFO Random Number Generator */ > >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c > >> index 5d9ea059594f..c483df6c9393 100644 > >> --- a/arch/powerpc/kernel/prom_init.c > >> +++ b/arch/powerpc/kernel/prom_init.c > >> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = { > >> #else > >> 0, > >> #endif > >> - .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN), > >> + .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) | > >> + OV5_FEAT(OV5_FORM2_AFFINITY), > >> .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), > >> .micro_checkpoint = 0, > >> .reserved0 = 0, > >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > >> index c6293037a103..c68846fc9550 100644 > >> --- a/arch/powerpc/mm/numa.c > >> +++ b/arch/powerpc/mm/numa.c > >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells; > >> > >> #define FORM0_AFFINITY 0 > >> #define FORM1_AFFINITY 1 > >> +#define FORM2_AFFINITY 2 > >> static int affinity_form; > >> > >> #define MAX_DISTANCE_REF_POINTS 4 > >> static int max_associativity_domain_index; > >> static const __be32 *distance_ref_points; > >> static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; > >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = { > >> + [0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 } > >> +}; > >> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; > >> > >> /* > >> * Allocate node_to_cpumask_map based on number of available nodes > >> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu) > >> } > >> #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */ > >> > >> +/* > >> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > >> + * info is found. > >> + */ > >> +static int associativity_to_nid(const __be32 *associativity) > >> +{ > >> + int nid = NUMA_NO_NODE; > >> + > >> + if (!numa_enabled) > >> + goto out; > >> + > >> + if (of_read_number(associativity, 1) >= primary_domain_index) > >> + nid = of_read_number(&associativity[primary_domain_index], 1); > >> + > >> + /* POWER4 LPAR uses 0xffff as invalid node */ > >> + if (nid == 0xffff || nid >= nr_node_ids) > >> + nid = NUMA_NO_NODE; > >> +out: > >> + return nid; > >> +} > >> + > >> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > >> +{ > >> + int dist; > >> + int node1, node2; > >> + > >> + node1 = associativity_to_nid(cpu1_assoc); > >> + node2 = associativity_to_nid(cpu2_assoc); > >> + > >> + dist = numa_distance_table[node1][node2]; > >> + if (dist <= LOCAL_DISTANCE) > >> + return 0; > >> + else if (dist <= REMOTE_DISTANCE) > >> + return 1; > >> + else > >> + return 2; > > > > Squashing the full range of distances into just 0, 1 or 2 seems odd. > > But then, this whole cpu_distance() thing being distinct from > > node_distance() seems odd. > > > >> +} > >> + > >> static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > >> { > >> int dist = 0; > >> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) > >> { > >> /* We should not get called with FORM0 */ > >> VM_WARN_ON(affinity_form == FORM0_AFFINITY); > >> - > >> - return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); > >> + if (affinity_form == FORM1_AFFINITY) > >> + return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); > >> + return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc); > >> } > >> > >> /* must hold reference to node during call */ > >> @@ -201,7 +245,9 @@ int __node_distance(int a, int b) > >> int i; > >> int distance = LOCAL_DISTANCE; > >> > >> - if (affinity_form == FORM0_AFFINITY) > >> + if (affinity_form == FORM2_AFFINITY) > >> + return numa_distance_table[a][b]; > >> + else if (affinity_form == FORM0_AFFINITY) > >> return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); > >> > >> for (i = 0; i < max_associativity_domain_index; i++) { > > > > Hmm.. couldn't we simplify this whole __node_distance function, if we > > just update numa_distance_table[][] appropriately for Form0 and Form1 > > as well? > > IIUC what you are suggesting is to look at the possibility of using > numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part > of separate patch? Ok, that's reasonable. > > > >> @@ -216,27 +262,6 @@ int __node_distance(int a, int b) > >> } > >> EXPORT_SYMBOL(__node_distance); > >> > >> -/* > >> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > >> - * info is found. > >> - */ > >> -static int associativity_to_nid(const __be32 *associativity) > >> -{ > >> - int nid = NUMA_NO_NODE; > >> - > >> - if (!numa_enabled) > >> - goto out; > >> - > >> - if (of_read_number(associativity, 1) >= primary_domain_index) > >> - nid = of_read_number(&associativity[primary_domain_index], 1); > >> - > >> - /* POWER4 LPAR uses 0xffff as invalid node */ > >> - if (nid == 0xffff || nid >= nr_node_ids) > >> - nid = NUMA_NO_NODE; > >> -out: > >> - return nid; > >> -} > >> - > >> /* Returns the nid associated with the given device tree node, > >> * or -1 if not found. > >> */ > >> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node) > >> */ > >> void update_numa_distance(struct device_node *node) > >> { > >> + int nid; > >> + > >> if (affinity_form == FORM0_AFFINITY) > >> return; > >> else if (affinity_form == FORM1_AFFINITY) { > >> initialize_form1_numa_distance(node); > >> return; > >> } > >> + > >> + /* FORM2 affinity */ > >> + nid = of_node_to_nid_single(node); > >> + if (nid == NUMA_NO_NODE) > >> + return; > >> + > >> + /* > >> + * With FORM2 we expect NUMA distance of all possible NUMA > >> + * nodes to be provided during boot. > >> + */ > >> + WARN(numa_distance_table[nid][nid] == -1, > >> + "NUMA distance details for node %d not provided\n", nid); > >> +} > >> + > >> +/* > >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} > >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} > >> + */ > >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root) > >> +{ > >> + int i, j; > >> + const __u8 *numa_dist_table; > >> + const __be32 *numa_lookup_index; > >> + int numa_dist_table_length; > >> + int max_numa_index, distance_index; > >> + > >> + numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); > >> + max_numa_index = of_read_number(&numa_lookup_index[0], 1); > >> + > >> + /* first element of the array is the size and is encode-int */ > >> + numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL); > >> + numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1); > >> + /* Skip the size which is encoded int */ > >> + numa_dist_table += sizeof(__be32); > >> + > >> + pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n", > >> + numa_dist_table_length, max_numa_index); > >> + > >> + for (i = 0; i < max_numa_index; i++) > >> + /* +1 skip the max_numa_index in the property */ > >> + numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1); > >> + > >> + > >> + if (numa_dist_table_length != max_numa_index * max_numa_index) { > >> + > >> + WARN(1, "Wrong NUMA distance information\n"); > >> + /* consider everybody else just remote. */ > >> + for (i = 0; i < max_numa_index; i++) { > >> + for (j = 0; j < max_numa_index; j++) { > >> + int nodeA = numa_id_index_table[i]; > >> + int nodeB = numa_id_index_table[j]; > >> + > >> + if (nodeA == nodeB) > >> + numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE; > >> + else > >> + numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE; > >> + } > >> + } > >> + } > >> + > >> + distance_index = 0; > >> + for (i = 0; i < max_numa_index; i++) { > >> + for (j = 0; j < max_numa_index; j++) { > >> + int nodeA = numa_id_index_table[i]; > >> + int nodeB = numa_id_index_table[j]; > >> + > >> + numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++]; > >> + pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); > >> + } > >> + } > >> } > >> > >> static int __init find_primary_domain_index(void) > >> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void) > >> */ > >> if (firmware_has_feature(FW_FEATURE_OPAL)) { > >> affinity_form = FORM1_AFFINITY; > >> + } else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) { > >> + dbg("Using form 2 affinity\n"); > >> + affinity_form = FORM2_AFFINITY; > >> } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) { > >> dbg("Using form 1 affinity\n"); > >> affinity_form = FORM1_AFFINITY; > >> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void) > >> > >> index = of_read_number(&distance_ref_points[1], 1); > >> } else { > >> + /* > >> + * Both FORM1 and FORM2 affinity find the primary domain details > >> + * at the same offset. > >> + */ > >> index = of_read_number(distance_ref_points, 1); > >> } > >> + /* > >> + * If it is FORM2 also initialize the distance table here. > >> + */ > >> + if (affinity_form == FORM2_AFFINITY) > >> + initialize_form2_numa_distance_lookup_table(root); > > > > Ew. Calling a function called "find_primary_domain_index" to also > > initialize the main distance table is needlessly counterintuitive. > > Move this call to parse_numa_properties(). > > The reason I ended up doing it here is because 'root' is already fetched > here. But I agree it is confusing. I will move fetching of root inside > initialize_form2_numa_distance_lookup_table() and move the function > outside primary_index lookup. Ok. This is not a hot path anyway, so looking up root twice isn't really a big deal anyway. > > modified arch/powerpc/mm/numa.c > @@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node) > * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} > * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} > */ > -static void initialize_form2_numa_distance_lookup_table(struct device_node *root) > +static void initialize_form2_numa_distance_lookup_table() > { > int i, j; > + struct device_node *root; > const __u8 *numa_dist_table; > const __be32 *numa_lookup_index; > int numa_dist_table_length; > int max_numa_index, distance_index; > > + if (firmware_has_feature(FW_FEATURE_OPAL)) > + root = of_find_node_by_path("/ibm,opal"); > + else > + root = of_find_node_by_path("/rtas"); > + if (!root) > + root = of_find_node_by_path("/"); > + > numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); > max_numa_index = of_read_number(&numa_lookup_index[0], 1); > > @@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root > pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); > } > } > + of_node_put(root); > } > > static int __init find_primary_domain_index(void) > @@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void) > */ > index = of_read_number(distance_ref_points, 1); > } > - /* > - * If it is FORM2 also initialize the distance table here. > - */ > - if (affinity_form == FORM2_AFFINITY) > - initialize_form2_numa_distance_lookup_table(root); > - > /* > * Warn and cap if the hardware supports more than > * MAX_DISTANCE_REF_POINTS domains. > @@ -916,6 +919,12 @@ static int __init parse_numa_properties(void) > > dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index); > > + /* > + * If it is FORM2 also initialize the distance table here. > + */ > + if (affinity_form == FORM2_AFFINITY) > + initialize_form2_numa_distance_lookup_table(); > + > /* > * Even though we connect cpus to numa domains later in SMP > * init, we need to know the node ids now. This is because > > -aneesh >
diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst new file mode 100644 index 000000000000..31cc7da2c7a6 --- /dev/null +++ b/Documentation/powerpc/associativity.rst @@ -0,0 +1,103 @@ +============================ +NUMA resource associativity +============================= + +Associativity represents the groupings of the various platform resources into +domains of substantially similar mean performance relative to resources outside +of that domain. Resources subsets of a given domain that exhibit better +performance relative to each other than relative to other resources subsets +are represented as being members of a sub-grouping domain. This performance +characteristic is presented in terms of NUMA node distance within the Linux kernel. +From the platform view, these groups are also referred to as domains. + +PAPR interface currently supports different ways of communicating these resource +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 +associativity grouping. Form 0 is the older format and is now considered deprecated. + +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property". +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. + +Form 0 +----- +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE). + +Form 1 +----- +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity +device tree properties are used to determine the NUMA distance between resource groups/domains. + +The “ibm,associativity” property contains a list of one or more numbers (domainID) +representing the resource’s platform grouping domains. + +The “ibm,associativity-reference-points” property contains a list of one or more numbers +(domainID index) that represents the 1 based ordinal in the associativity lists. +The list of domainID indexes represents an increasing hierarchy of resource grouping. + +ex: +{ primary domainID index, secondary domainID index, tertiary domainID index.. } + +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. +Linux kernel computes NUMA distance between two domains by recursively comparing +if they belong to the same higher-level domains. For mismatch at every higher +level of the resource group, the kernel doubles the NUMA distance between the +comparing domains. + +Form 2 +------- +Form 2 associativity format adds separate device tree properties representing NUMA node distance +thereby making the node distance computation flexible. Form 2 also allows flexible primary +domain numbering. With numa distance computation now detached from the index value in +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain +ids at the same domainID index representing resource groups of different performance/latency +characteristics. + +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the +"ibm,architecture-vec-5" property. + +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing +the domainIDs present in the system. The offset of the domainID in this property is +used as an index while computing numa distance information via "ibm,numa-distance-table". + +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by +N domainID encoded as with encode-int + +For ex: +"ibm,numa-lookup-index-table" = {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when +computing the distance of domain 8 from other domains present in the system. For the rest of +this document, this offset will be referred to as domain distance offset. + +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA +distance between resource groups/domains present in the system. + +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. +The number N must be equal to the square of m where m is the number of domainIDs in the +numa-lookup-index-table. + +For ex: +ibm,numa-lookup-index-table = {3, 0, 8, 40} +ibm,numa-distance-table = {9, 10, 20, 80, 20, 10, 160, 80, 160, 10} + + | 0 8 40 +--|------------ + | +0 | 10 20 80 + | +8 | 20 10 160 + | +40| 80 160 10 + +A possible "ibm,associativity" property for resources in node 0, 8 and 40 + +{ 3, 6, 7, 0 } +{ 3, 6, 9, 8 } +{ 3, 6, 7, 40} + +With "ibm,associativity-reference-points" { 0x3 } + +"ibm,lookup-index-table" helps in having a compact representation of distance matrix. +Since domainID can be sparse, the matrix of distances can also be effectively sparse. +With "ibm,lookup-index-table" we can achieve a compact representation of +distance information. diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 60b631161360..97a3bd9ffeb9 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -53,6 +53,7 @@ #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0000004000000000) #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0000008000000000) #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000) +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000) #ifndef __ASSEMBLY__ @@ -73,7 +74,7 @@ enum { FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | - FW_FEATURE_RPT_INVALIDATE, + FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index df9fec9d232c..5c80152e8f18 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop, #define OV5_XCMO 0x0440 /* Page Coalescing */ #define OV5_FORM1_AFFINITY 0x0580 /* FORM1 NUMA affinity */ #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ +#define OV5_FORM2_AFFINITY 0x0520 /* Form2 NUMA affinity */ #define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ #define OV5_PFO_HW_RNG 0x1180 /* PFO Random Number Generator */ diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index 5d9ea059594f..c483df6c9393 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = { #else 0, #endif - .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN), + .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) | + OV5_FEAT(OV5_FORM2_AFFINITY), .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), .micro_checkpoint = 0, .reserved0 = 0, diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index c6293037a103..c68846fc9550 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells; #define FORM0_AFFINITY 0 #define FORM1_AFFINITY 1 +#define FORM2_AFFINITY 2 static int affinity_form; #define MAX_DISTANCE_REF_POINTS 4 static int max_associativity_domain_index; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = { + [0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 } +}; +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu) } #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */ +/* + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA + * info is found. + */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid = NUMA_NO_NODE; + + if (!numa_enabled) + goto out; + + if (of_read_number(associativity, 1) >= primary_domain_index) + nid = of_read_number(&associativity[primary_domain_index], 1); + + /* POWER4 LPAR uses 0xffff as invalid node */ + if (nid == 0xffff || nid >= nr_node_ids) + nid = NUMA_NO_NODE; +out: + return nid; +} + +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) +{ + int dist; + int node1, node2; + + node1 = associativity_to_nid(cpu1_assoc); + node2 = associativity_to_nid(cpu2_assoc); + + dist = numa_distance_table[node1][node2]; + if (dist <= LOCAL_DISTANCE) + return 0; + else if (dist <= REMOTE_DISTANCE) + return 1; + else + return 2; +} + static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) { int dist = 0; @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) { /* We should not get called with FORM0 */ VM_WARN_ON(affinity_form == FORM0_AFFINITY); - - return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); + if (affinity_form == FORM1_AFFINITY) + return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); + return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc); } /* must hold reference to node during call */ @@ -201,7 +245,9 @@ int __node_distance(int a, int b) int i; int distance = LOCAL_DISTANCE; - if (affinity_form == FORM0_AFFINITY) + if (affinity_form == FORM2_AFFINITY) + return numa_distance_table[a][b]; + else if (affinity_form == FORM0_AFFINITY) return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); for (i = 0; i < max_associativity_domain_index; i++) { @@ -216,27 +262,6 @@ int __node_distance(int a, int b) } EXPORT_SYMBOL(__node_distance); -/* - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA - * info is found. - */ -static int associativity_to_nid(const __be32 *associativity) -{ - int nid = NUMA_NO_NODE; - - if (!numa_enabled) - goto out; - - if (of_read_number(associativity, 1) >= primary_domain_index) - nid = of_read_number(&associativity[primary_domain_index], 1); - - /* POWER4 LPAR uses 0xffff as invalid node */ - if (nid == 0xffff || nid >= nr_node_ids) - nid = NUMA_NO_NODE; -out: - return nid; -} - /* Returns the nid associated with the given device tree node, * or -1 if not found. */ @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node) */ void update_numa_distance(struct device_node *node) { + int nid; + if (affinity_form == FORM0_AFFINITY) return; else if (affinity_form == FORM1_AFFINITY) { initialize_form1_numa_distance(node); return; } + + /* FORM2 affinity */ + nid = of_node_to_nid_single(node); + if (nid == NUMA_NO_NODE) + return; + + /* + * With FORM2 we expect NUMA distance of all possible NUMA + * nodes to be provided during boot. + */ + WARN(numa_distance_table[nid][nid] == -1, + "NUMA distance details for node %d not provided\n", nid); +} + +/* + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN} + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements} + */ +static void initialize_form2_numa_distance_lookup_table(struct device_node *root) +{ + int i, j; + const __u8 *numa_dist_table; + const __be32 *numa_lookup_index; + int numa_dist_table_length; + int max_numa_index, distance_index; + + numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL); + max_numa_index = of_read_number(&numa_lookup_index[0], 1); + + /* first element of the array is the size and is encode-int */ + numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL); + numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1); + /* Skip the size which is encoded int */ + numa_dist_table += sizeof(__be32); + + pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n", + numa_dist_table_length, max_numa_index); + + for (i = 0; i < max_numa_index; i++) + /* +1 skip the max_numa_index in the property */ + numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1); + + + if (numa_dist_table_length != max_numa_index * max_numa_index) { + + WARN(1, "Wrong NUMA distance information\n"); + /* consider everybody else just remote. */ + for (i = 0; i < max_numa_index; i++) { + for (j = 0; j < max_numa_index; j++) { + int nodeA = numa_id_index_table[i]; + int nodeB = numa_id_index_table[j]; + + if (nodeA == nodeB) + numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE; + else + numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE; + } + } + } + + distance_index = 0; + for (i = 0; i < max_numa_index; i++) { + for (j = 0; j < max_numa_index; j++) { + int nodeA = numa_id_index_table[i]; + int nodeB = numa_id_index_table[j]; + + numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++]; + pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]); + } + } } static int __init find_primary_domain_index(void) @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void) */ if (firmware_has_feature(FW_FEATURE_OPAL)) { affinity_form = FORM1_AFFINITY; + } else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) { + dbg("Using form 2 affinity\n"); + affinity_form = FORM2_AFFINITY; } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) { dbg("Using form 1 affinity\n"); affinity_form = FORM1_AFFINITY; @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void) index = of_read_number(&distance_ref_points[1], 1); } else { + /* + * Both FORM1 and FORM2 affinity find the primary domain details + * at the same offset. + */ index = of_read_number(distance_ref_points, 1); } + /* + * If it is FORM2 also initialize the distance table here. + */ + if (affinity_form == FORM2_AFFINITY) + initialize_form2_numa_distance_lookup_table(root); /* * Warn and cap if the hardware supports more than diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 5d4c2bc20bba..f162156b7b68 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -123,6 +123,7 @@ vec5_fw_features_table[] = { {FW_FEATURE_PRRN, OV5_PRRN}, {FW_FEATURE_DRMEM_V2, OV5_DRMEM_V2}, {FW_FEATURE_DRC_INFO, OV5_DRC_INFO}, + {FW_FEATURE_FORM2_AFFINITY, OV5_FORM2_AFFINITY}, }; static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)