diff mbox series

[v5,6/6] powerpc/pseries: Add support for FORM2 associativity

Message ID 20210628151117.545935-7-aneesh.kumar@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series Add support for FORM2 associativity | expand
Related show

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (0f7a719601eb957c10d417c62bd5f65080b5a409)
snowpatch_ozlabs/build-ppc64le warning Build succeeded but added 1 new sparse warnings
snowpatch_ozlabs/build-ppc64be warning Build succeeded but added 1 new sparse warnings
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch warning total: 0 errors, 3 warnings, 3 checks, 360 lines checked
snowpatch_ozlabs/needsstable success Patch has no Fixes tags

Commit Message

Aneesh Kumar K V June 28, 2021, 3:11 p.m. UTC
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
 arch/powerpc/include/asm/firmware.h       |   3 +-
 arch/powerpc/include/asm/prom.h           |   1 +
 arch/powerpc/kernel/prom_init.c           |   3 +-
 arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 242 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

Comments

David Gibson July 22, 2021, 2:28 a.m. UTC | #1
On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
>  arch/powerpc/include/asm/firmware.h       |   3 +-
>  arch/powerpc/include/asm/prom.h           |   1 +
>  arch/powerpc/kernel/prom_init.c           |   3 +-
>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>  6 files changed, 242 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..31cc7da2c7a6
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,103 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.

Pretty hard to decipher, but that's typical for PAPR.

> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.

Nit: s/older/oldest/ since there are now >2 forms.

> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> +
> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.
> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value in
> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
> +ids at the same domainID index representing resource groups of different performance/latency
> +characteristics.
> +
> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is
> +used as an index while computing numa distance information via "ibm,numa-distance-table".
> +
> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
> +computing the distance of domain 8 from other domains present in the system. For the rest of
> +this document, this offset will be referred to as domain distance offset.
> +
> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +The number N must be equal to the square of m where m is the number of domainIDs in the
> +numa-lookup-index-table.
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}

This representation doesn't make it clear that the 9 is a u32, but the
rest are u8s.

> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }

You haven't actually described how ibm,associativity-reference-points
operates in Form2.

> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
> +With "ibm,lookup-index-table" we can achieve a compact representation of
> +distance information.
> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>  
>  #ifndef __ASSEMBLY__
>  
> @@ -73,7 +74,7 @@ enum {
>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 5d9ea059594f..c483df6c9393 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>  #else
>  		0,
>  #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>  		.micro_checkpoint = 0,
>  		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index c6293037a103..c68846fc9550 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  
>  #define FORM0_AFFINITY 0
>  #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>  static int affinity_form;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
>  static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>  
>  /*
>   * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
>  }
>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>  
> +/*
> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> + * info is found.
> + */
> +static int associativity_to_nid(const __be32 *associativity)
> +{
> +	int nid = NUMA_NO_NODE;
> +
> +	if (!numa_enabled)
> +		goto out;
> +
> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> +
> +	/* POWER4 LPAR uses 0xffff as invalid node */
> +	if (nid == 0xffff || nid >= nr_node_ids)
> +		nid = NUMA_NO_NODE;
> +out:
> +	return nid;
> +}
> +
> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist;
> +	int node1, node2;
> +
> +	node1 = associativity_to_nid(cpu1_assoc);
> +	node2 = associativity_to_nid(cpu2_assoc);
> +
> +	dist = numa_distance_table[node1][node2];
> +	if (dist <= LOCAL_DISTANCE)
> +		return 0;
> +	else if (dist <= REMOTE_DISTANCE)
> +		return 1;
> +	else
> +		return 2;

Squashing the full range of distances into just 0, 1 or 2 seems odd.
But then, this whole cpu_distance() thing being distinct from
node_distance() seems odd.

> +}
> +
>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	int dist = 0;
> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	/* We should not get called with FORM0 */
>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
>  }
>  
>  /* must hold reference to node during call */
> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
>  	int i;
>  	int distance = LOCAL_DISTANCE;
>  
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
>  	for (i = 0; i < max_associativity_domain_index; i++) {

Hmm.. couldn't we simplify this whole __node_distance function, if we
just update numa_distance_table[][] appropriately for Form0 and Form1
as well?

> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
>  }
>  EXPORT_SYMBOL(__node_distance);
>  
> -/*
> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> - * info is found.
> - */
> -static int associativity_to_nid(const __be32 *associativity)
> -{
> -	int nid = NUMA_NO_NODE;
> -
> -	if (!numa_enabled)
> -		goto out;
> -
> -	if (of_read_number(associativity, 1) >= primary_domain_index)
> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> -
> -	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= nr_node_ids)
> -		nid = NUMA_NO_NODE;
> -out:
> -	return nid;
> -}
> -
>  /* Returns the nid associated with the given device tree node,
>   * or -1 if not found.
>   */
> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
>   */
>  void update_numa_distance(struct device_node *node)
>  {
> +	int nid;
> +
>  	if (affinity_form == FORM0_AFFINITY)
>  		return;
>  	else if (affinity_form == FORM1_AFFINITY) {
>  		initialize_form1_numa_distance(node);
>  		return;
>  	}
> +
> +	/* FORM2 affinity  */
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/*
> +	 * With FORM2 we expect NUMA distance of all possible NUMA
> +	 * nodes to be provided during boot.
> +	 */
> +	WARN(numa_distance_table[nid][nid] == -1,
> +	     "NUMA distance details for node %d not provided\n", nid);
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	int i, j;
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> +
> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
> +
> +		WARN(1, "Wrong NUMA distance information\n");
> +		/* consider everybody else just remote. */
> +		for (i = 0;  i < max_numa_index; i++) {
> +			for (j = 0; j < max_numa_index; j++) {
> +				int nodeA = numa_id_index_table[i];
> +				int nodeB = numa_id_index_table[j];
> +
> +				if (nodeA == nodeB)
> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
> +				else
> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
> +			}
> +		}
> +	}
> +
> +	distance_index = 0;
> +	for (i = 0;  i < max_numa_index; i++) {
> +		for (j = 0; j < max_numa_index; j++) {
> +			int nodeA = numa_id_index_table[i];
> +			int nodeB = numa_id_index_table[j];
> +
> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		}
> +	}
>  }
>  
>  static int __init find_primary_domain_index(void)
> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
>  	 */
>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>  		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>  		dbg("Using form 1 affinity\n");
>  		affinity_form = FORM1_AFFINITY;
> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
>  
>  		index = of_read_number(&distance_ref_points[1], 1);
>  	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>  		index = of_read_number(distance_ref_points, 1);
>  	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);

Ew.  Calling a function called "find_primary_domain_index" to also
initialize the main distance table is needlessly counterintuitive.
Move this call to parse_numa_properties().
>  
>  	/*
>  	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>  	{FW_FEATURE_PRRN,		OV5_PRRN},
>  	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>  	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>  };
>  
>  static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
Aneesh Kumar K V July 22, 2021, 7:34 a.m. UTC | #2
David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
>>  arch/powerpc/include/asm/firmware.h       |   3 +-
>>  arch/powerpc/include/asm/prom.h           |   1 +
>>  arch/powerpc/kernel/prom_init.c           |   3 +-
>>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 242 insertions(+), 26 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..31cc7da2c7a6
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,103 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform resources into
>> +domains of substantially similar mean performance relative to resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>
> Pretty hard to decipher, but that's typical for PAPR.
>
>> +PAPR interface currently supports different ways of communicating these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>
> Nit: s/older/oldest/ since there are now >2 forms.

updated.

>
>> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at every higher
>> +level of the resource group, the kernel doubles the NUMA distance between the
>> +comparing domains.
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> +domain numbering. With numa distance computation now detached from the index value in
>> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
>> +ids at the same domainID index representing resource groups of different performance/latency
>> +characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this property is
>> +used as an index while computing numa distance information via "ibm,numa-distance-table".
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
>> +computing the distance of domain 8 from other domains present in the system. For the rest of
>> +this document, this offset will be referred to as domain distance offset.
>> +
>> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> +The number N must be equal to the square of m where m is the number of domainIDs in the
>> +numa-lookup-index-table.
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>
> This representation doesn't make it clear that the 9 is a u32, but the
> rest are u8s.

How do you suggest we specify that? I could do 9:u32 10:u8 etc. But
considering the details are explained in the paragraph above, is that
needed? 

>
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>
> You haven't actually described how ibm,associativity-reference-points
> operates in Form2.

Nothing change w.r.t the definition of associativity-reference-points
w.r.t FORM2. It still will continue to show the increasing hierarchy of
resource groups.

>
>> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
>> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
>> +With "ibm,lookup-index-table" we can achieve a compact representation of
>> +distance information.
>> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>> @@ -53,6 +53,7 @@
>>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
>> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> @@ -73,7 +74,7 @@ enum {
>>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
>> -		FW_FEATURE_RPT_INVALIDATE,
>> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>>  	FW_FEATURE_POWERNV_ALWAYS = 0,
>> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
>> index df9fec9d232c..5c80152e8f18 100644
>> --- a/arch/powerpc/include/asm/prom.h
>> +++ b/arch/powerpc/include/asm/prom.h
>> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
>> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
>> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> index 5d9ea059594f..c483df6c9393 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>>  #else
>>  		0,
>>  #endif
>> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
>> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
>> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>>  		.micro_checkpoint = 0,
>>  		.reserved0 = 0,
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index c6293037a103..c68846fc9550 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>>  
>>  #define FORM0_AFFINITY 0
>>  #define FORM1_AFFINITY 1
>> +#define FORM2_AFFINITY 2
>>  static int affinity_form;
>>  
>>  #define MAX_DISTANCE_REF_POINTS 4
>>  static int max_associativity_domain_index;
>>  static const __be32 *distance_ref_points;
>>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
>> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
>> +};
>> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>>  
>>  /*
>>   * Allocate node_to_cpumask_map based on number of available nodes
>> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> +/*
>> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>> + * info is found.
>> + */
>> +static int associativity_to_nid(const __be32 *associativity)
>> +{
>> +	int nid = NUMA_NO_NODE;
>> +
>> +	if (!numa_enabled)
>> +		goto out;
>> +
>> +	if (of_read_number(associativity, 1) >= primary_domain_index)
>> +		nid = of_read_number(&associativity[primary_domain_index], 1);
>> +
>> +	/* POWER4 LPAR uses 0xffff as invalid node */
>> +	if (nid == 0xffff || nid >= nr_node_ids)
>> +		nid = NUMA_NO_NODE;
>> +out:
>> +	return nid;
>> +}
>> +
>> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +	int dist;
>> +	int node1, node2;
>> +
>> +	node1 = associativity_to_nid(cpu1_assoc);
>> +	node2 = associativity_to_nid(cpu2_assoc);
>> +
>> +	dist = numa_distance_table[node1][node2];
>> +	if (dist <= LOCAL_DISTANCE)
>> +		return 0;
>> +	else if (dist <= REMOTE_DISTANCE)
>> +		return 1;
>> +	else
>> +		return 2;
>
> Squashing the full range of distances into just 0, 1 or 2 seems odd.
> But then, this whole cpu_distance() thing being distinct from
> node_distance() seems odd.
>
>> +}
>> +
>>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	int dist = 0;
>> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	/* We should not get called with FORM0 */
>>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> -
>> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +	if (affinity_form == FORM1_AFFINITY)
>> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
>>  }
>>  
>>  /* must hold reference to node during call */
>> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
>>  	int i;
>>  	int distance = LOCAL_DISTANCE;
>>  
>> -	if (affinity_form == FORM0_AFFINITY)
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		return numa_distance_table[a][b];
>> +	else if (affinity_form == FORM0_AFFINITY)
>>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>>  
>>  	for (i = 0; i < max_associativity_domain_index; i++) {
>
> Hmm.. couldn't we simplify this whole __node_distance function, if we
> just update numa_distance_table[][] appropriately for Form0 and Form1
> as well?

IIUC what you are suggesting is to look at the possibility of using
numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part
of separate patch?

>
>> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
>>  }
>>  EXPORT_SYMBOL(__node_distance);
>>  
>> -/*
>> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>> - * info is found.
>> - */
>> -static int associativity_to_nid(const __be32 *associativity)
>> -{
>> -	int nid = NUMA_NO_NODE;
>> -
>> -	if (!numa_enabled)
>> -		goto out;
>> -
>> -	if (of_read_number(associativity, 1) >= primary_domain_index)
>> -		nid = of_read_number(&associativity[primary_domain_index], 1);
>> -
>> -	/* POWER4 LPAR uses 0xffff as invalid node */
>> -	if (nid == 0xffff || nid >= nr_node_ids)
>> -		nid = NUMA_NO_NODE;
>> -out:
>> -	return nid;
>> -}
>> -
>>  /* Returns the nid associated with the given device tree node,
>>   * or -1 if not found.
>>   */
>> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
>>   */
>>  void update_numa_distance(struct device_node *node)
>>  {
>> +	int nid;
>> +
>>  	if (affinity_form == FORM0_AFFINITY)
>>  		return;
>>  	else if (affinity_form == FORM1_AFFINITY) {
>>  		initialize_form1_numa_distance(node);
>>  		return;
>>  	}
>> +
>> +	/* FORM2 affinity  */
>> +	nid = of_node_to_nid_single(node);
>> +	if (nid == NUMA_NO_NODE)
>> +		return;
>> +
>> +	/*
>> +	 * With FORM2 we expect NUMA distance of all possible NUMA
>> +	 * nodes to be provided during boot.
>> +	 */
>> +	WARN(numa_distance_table[nid][nid] == -1,
>> +	     "NUMA distance details for node %d not provided\n", nid);
>> +}
>> +
>> +/*
>> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>> + */
>> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
>> +{
>> +	int i, j;
>> +	const __u8 *numa_dist_table;
>> +	const __be32 *numa_lookup_index;
>> +	int numa_dist_table_length;
>> +	int max_numa_index, distance_index;
>> +
>> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>> +
>> +	/* first element of the array is the size and is encode-int */
>> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
>> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
>> +	/* Skip the size which is encoded int */
>> +	numa_dist_table += sizeof(__be32);
>> +
>> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
>> +		 numa_dist_table_length, max_numa_index);
>> +
>> +	for (i = 0; i < max_numa_index; i++)
>> +		/* +1 skip the max_numa_index in the property */
>> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
>> +
>> +
>> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
>> +
>> +		WARN(1, "Wrong NUMA distance information\n");
>> +		/* consider everybody else just remote. */
>> +		for (i = 0;  i < max_numa_index; i++) {
>> +			for (j = 0; j < max_numa_index; j++) {
>> +				int nodeA = numa_id_index_table[i];
>> +				int nodeB = numa_id_index_table[j];
>> +
>> +				if (nodeA == nodeB)
>> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
>> +				else
>> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
>> +			}
>> +		}
>> +	}
>> +
>> +	distance_index = 0;
>> +	for (i = 0;  i < max_numa_index; i++) {
>> +		for (j = 0; j < max_numa_index; j++) {
>> +			int nodeA = numa_id_index_table[i];
>> +			int nodeB = numa_id_index_table[j];
>> +
>> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
>> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
>> +		}
>> +	}
>>  }
>>  
>>  static int __init find_primary_domain_index(void)
>> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
>>  	 */
>>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>>  		affinity_form = FORM1_AFFINITY;
>> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
>> +		dbg("Using form 2 affinity\n");
>> +		affinity_form = FORM2_AFFINITY;
>>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>>  		dbg("Using form 1 affinity\n");
>>  		affinity_form = FORM1_AFFINITY;
>> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
>>  
>>  		index = of_read_number(&distance_ref_points[1], 1);
>>  	} else {
>> +		/*
>> +		 * Both FORM1 and FORM2 affinity find the primary domain details
>> +		 * at the same offset.
>> +		 */
>>  		index = of_read_number(distance_ref_points, 1);
>>  	}
>> +	/*
>> +	 * If it is FORM2 also initialize the distance table here.
>> +	 */
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		initialize_form2_numa_distance_lookup_table(root);
>
> Ew.  Calling a function called "find_primary_domain_index" to also
> initialize the main distance table is needlessly counterintuitive.
> Move this call to parse_numa_properties().

The reason I ended up doing it here is because 'root' is already fetched
here. But I agree it is confusing. I will move fetching of root inside
initialize_form2_numa_distance_lookup_table() and move the function
outside primary_index lookup.

modified   arch/powerpc/mm/numa.c
@@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node)
  * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
  * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
  */
-static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
+static void initialize_form2_numa_distance_lookup_table()
 {
 	int i, j;
+	struct device_node *root;
 	const __u8 *numa_dist_table;
 	const __be32 *numa_lookup_index;
 	int numa_dist_table_length;
 	int max_numa_index, distance_index;
 
+	if (firmware_has_feature(FW_FEATURE_OPAL))
+		root = of_find_node_by_path("/ibm,opal");
+	else
+		root = of_find_node_by_path("/rtas");
+	if (!root)
+		root = of_find_node_by_path("/");
+
 	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
 	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
 
@@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root
 			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
 		}
 	}
+	of_node_put(root);
 }
 
 static int __init find_primary_domain_index(void)
@@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void)
 		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
-	/*
-	 * If it is FORM2 also initialize the distance table here.
-	 */
-	if (affinity_form == FORM2_AFFINITY)
-		initialize_form2_numa_distance_lookup_table(root);
-
 	/*
 	 * Warn and cap if the hardware supports more than
 	 * MAX_DISTANCE_REF_POINTS domains.
@@ -916,6 +919,12 @@ static int __init parse_numa_properties(void)
 
 	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
+	/*
+	 * If it is FORM2 also initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table();
+
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
 	 * init, we need to know the node ids now. This is because

-aneesh
David Gibson July 26, 2021, 2:41 a.m. UTC | #3
On Thu, Jul 22, 2021 at 01:04:42PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
> >> PAPR interface currently supports two different ways of communicating resource
> >> grouping details to the OS. These are referred to as Form 0 and Form 1
> >> associativity grouping. Form 0 is the older format and is now considered
> >> deprecated. This patch adds another resource grouping named FORM2.
> >> 
> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
> >>  arch/powerpc/include/asm/prom.h           |   1 +
> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
> >>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
> >>  6 files changed, 242 insertions(+), 26 deletions(-)
> >>  create mode 100644 Documentation/powerpc/associativity.rst
> >> 
> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> >> new file mode 100644
> >> index 000000000000..31cc7da2c7a6
> >> --- /dev/null
> >> +++ b/Documentation/powerpc/associativity.rst
> >> @@ -0,0 +1,103 @@
> >> +============================
> >> +NUMA resource associativity
> >> +=============================
> >> +
> >> +Associativity represents the groupings of the various platform resources into
> >> +domains of substantially similar mean performance relative to resources outside
> >> +of that domain. Resources subsets of a given domain that exhibit better
> >> +performance relative to each other than relative to other resources subsets
> >> +are represented as being members of a sub-grouping domain. This performance
> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> >> +From the platform view, these groups are also referred to as domains.
> >
> > Pretty hard to decipher, but that's typical for PAPR.
> >
> >> +PAPR interface currently supports different ways of communicating these resource
> >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> >> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> >
> > Nit: s/older/oldest/ since there are now >2 forms.
> 
> updated.
> 
> >
> >> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> >> +
> >> +Form 0
> >> +-----
> >> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
> >> +
> >> +Form 1
> >> +-----
> >> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
> >> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> >> +
> >> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
> >> +representing the resource’s platform grouping domains.
> >> +
> >> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
> >> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> >> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
> >> +
> >> +ex:
> >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> >> +
> >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> >> +Linux kernel computes NUMA distance between two domains by recursively comparing
> >> +if they belong to the same higher-level domains. For mismatch at every higher
> >> +level of the resource group, the kernel doubles the NUMA distance between the
> >> +comparing domains.
> >> +
> >> +Form 2
> >> +-------
> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> >> +domain numbering. With numa distance computation now detached from the index value in
> >> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
> >> +ids at the same domainID index representing resource groups of different performance/latency
> >> +characteristics.
> >> +
> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> >> +"ibm,architecture-vec-5" property.
> >> +
> >> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
> >> +the domainIDs present in the system. The offset of the domainID in this property is
> >> +used as an index while computing numa distance information via "ibm,numa-distance-table".
> >> +
> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> >> +N domainID encoded as with encode-int
> >> +
> >> +For ex:
> >> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
> >> +computing the distance of domain 8 from other domains present in the system. For the rest of
> >> +this document, this offset will be referred to as domain distance offset.
> >> +
> >> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
> >> +distance between resource groups/domains present in the system.
> >> +
> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> >> +The number N must be equal to the square of m where m is the number of domainIDs in the
> >> +numa-lookup-index-table.
> >> +
> >> +For ex:
> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> >> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> >
> > This representation doesn't make it clear that the 9 is a u32, but the
> > rest are u8s.
> 
> How do you suggest we specify that? I could do 9:u32 10:u8 etc. But
> considering the details are explained in the paragraph above, is that
> needed?

Yes, I think it is needed.  The examples are, honestly, a lot easier
to read and follow than the PAPR-ese text, so people are much more
likely to be looking at those than parsing the minutiae of the text.

> >> +
> >> +  | 0    8   40
> >> +--|------------
> >> +  |
> >> +0 | 10   20  80
> >> +  |
> >> +8 | 20   10  160
> >> +  |
> >> +40| 80   160  10
> >> +
> >> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
> >> +
> >> +{ 3, 6, 7, 0 }
> >> +{ 3, 6, 9, 8 }
> >> +{ 3, 6, 7, 40}
> >> +
> >> +With "ibm,associativity-reference-points"  { 0x3 }
> >
> > You haven't actually described how ibm,associativity-reference-points
> > operates in Form2.
> 
> Nothing change w.r.t the definition of associativity-reference-points
> w.r.t FORM2. It still will continue to show the increasing hierarchy of
> resource groups.

I guess, except that really none of them matter except the primary any
more.

> 
> >
> >> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
> >> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
> >> +With "ibm,lookup-index-table" we can achieve a compact representation of
> >> +distance information.
> >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> >> index 60b631161360..97a3bd9ffeb9 100644
> >> --- a/arch/powerpc/include/asm/firmware.h
> >> +++ b/arch/powerpc/include/asm/firmware.h
> >> @@ -53,6 +53,7 @@
> >>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
> >>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
> >>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
> >>  
> >>  #ifndef __ASSEMBLY__
> >>  
> >> @@ -73,7 +74,7 @@ enum {
> >>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
> >>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
> >>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> >> -		FW_FEATURE_RPT_INVALIDATE,
> >> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
> >>  	FW_FEATURE_PSERIES_ALWAYS = 0,
> >>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
> >>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> >> index df9fec9d232c..5c80152e8f18 100644
> >> --- a/arch/powerpc/include/asm/prom.h
> >> +++ b/arch/powerpc/include/asm/prom.h
> >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
> >>  #define OV5_XCMO		0x0440	/* Page Coalescing */
> >>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
> >>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> >> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
> >>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
> >>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
> >>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> >> index 5d9ea059594f..c483df6c9393 100644
> >> --- a/arch/powerpc/kernel/prom_init.c
> >> +++ b/arch/powerpc/kernel/prom_init.c
> >> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
> >>  #else
> >>  		0,
> >>  #endif
> >> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> >> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> >> +		OV5_FEAT(OV5_FORM2_AFFINITY),
> >>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
> >>  		.micro_checkpoint = 0,
> >>  		.reserved0 = 0,
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index c6293037a103..c68846fc9550 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
> >>  
> >>  #define FORM0_AFFINITY 0
> >>  #define FORM1_AFFINITY 1
> >> +#define FORM2_AFFINITY 2
> >>  static int affinity_form;
> >>  
> >>  #define MAX_DISTANCE_REF_POINTS 4
> >>  static int max_associativity_domain_index;
> >>  static const __be32 *distance_ref_points;
> >>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> >> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> >> +};
> >> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
> >>  
> >>  /*
> >>   * Allocate node_to_cpumask_map based on number of available nodes
> >> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >>  }
> >>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> >>  
> >> +/*
> >> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >> + * info is found.
> >> + */
> >> +static int associativity_to_nid(const __be32 *associativity)
> >> +{
> >> +	int nid = NUMA_NO_NODE;
> >> +
> >> +	if (!numa_enabled)
> >> +		goto out;
> >> +
> >> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> >> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> >> +
> >> +	/* POWER4 LPAR uses 0xffff as invalid node */
> >> +	if (nid == 0xffff || nid >= nr_node_ids)
> >> +		nid = NUMA_NO_NODE;
> >> +out:
> >> +	return nid;
> >> +}
> >> +
> >> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +{
> >> +	int dist;
> >> +	int node1, node2;
> >> +
> >> +	node1 = associativity_to_nid(cpu1_assoc);
> >> +	node2 = associativity_to_nid(cpu2_assoc);
> >> +
> >> +	dist = numa_distance_table[node1][node2];
> >> +	if (dist <= LOCAL_DISTANCE)
> >> +		return 0;
> >> +	else if (dist <= REMOTE_DISTANCE)
> >> +		return 1;
> >> +	else
> >> +		return 2;
> >
> > Squashing the full range of distances into just 0, 1 or 2 seems odd.
> > But then, this whole cpu_distance() thing being distinct from
> > node_distance() seems odd.
> >
> >> +}
> >> +
> >>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	int dist = 0;
> >> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	/* We should not get called with FORM0 */
> >>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> >> -
> >> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> >> +	if (affinity_form == FORM1_AFFINITY)
> >> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> >> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
> >>  }
> >>  
> >>  /* must hold reference to node during call */
> >> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
> >>  	int i;
> >>  	int distance = LOCAL_DISTANCE;
> >>  
> >> -	if (affinity_form == FORM0_AFFINITY)
> >> +	if (affinity_form == FORM2_AFFINITY)
> >> +		return numa_distance_table[a][b];
> >> +	else if (affinity_form == FORM0_AFFINITY)
> >>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
> >>  
> >>  	for (i = 0; i < max_associativity_domain_index; i++) {
> >
> > Hmm.. couldn't we simplify this whole __node_distance function, if we
> > just update numa_distance_table[][] appropriately for Form0 and Form1
> > as well?
> 
> IIUC what you are suggesting is to look at the possibility of using
> numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part
> of separate patch?

Ok, that's reasonable.

> >
> >> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
> >>  }
> >>  EXPORT_SYMBOL(__node_distance);
> >>  
> >> -/*
> >> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >> - * info is found.
> >> - */
> >> -static int associativity_to_nid(const __be32 *associativity)
> >> -{
> >> -	int nid = NUMA_NO_NODE;
> >> -
> >> -	if (!numa_enabled)
> >> -		goto out;
> >> -
> >> -	if (of_read_number(associativity, 1) >= primary_domain_index)
> >> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> >> -
> >> -	/* POWER4 LPAR uses 0xffff as invalid node */
> >> -	if (nid == 0xffff || nid >= nr_node_ids)
> >> -		nid = NUMA_NO_NODE;
> >> -out:
> >> -	return nid;
> >> -}
> >> -
> >>  /* Returns the nid associated with the given device tree node,
> >>   * or -1 if not found.
> >>   */
> >> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
> >>   */
> >>  void update_numa_distance(struct device_node *node)
> >>  {
> >> +	int nid;
> >> +
> >>  	if (affinity_form == FORM0_AFFINITY)
> >>  		return;
> >>  	else if (affinity_form == FORM1_AFFINITY) {
> >>  		initialize_form1_numa_distance(node);
> >>  		return;
> >>  	}
> >> +
> >> +	/* FORM2 affinity  */
> >> +	nid = of_node_to_nid_single(node);
> >> +	if (nid == NUMA_NO_NODE)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * With FORM2 we expect NUMA distance of all possible NUMA
> >> +	 * nodes to be provided during boot.
> >> +	 */
> >> +	WARN(numa_distance_table[nid][nid] == -1,
> >> +	     "NUMA distance details for node %d not provided\n", nid);
> >> +}
> >> +
> >> +/*
> >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> >> + */
> >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> >> +{
> >> +	int i, j;
> >> +	const __u8 *numa_dist_table;
> >> +	const __be32 *numa_lookup_index;
> >> +	int numa_dist_table_length;
> >> +	int max_numa_index, distance_index;
> >> +
> >> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> >> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> >> +
> >> +	/* first element of the array is the size and is encode-int */
> >> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> >> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> >> +	/* Skip the size which is encoded int */
> >> +	numa_dist_table += sizeof(__be32);
> >> +
> >> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
> >> +		 numa_dist_table_length, max_numa_index);
> >> +
> >> +	for (i = 0; i < max_numa_index; i++)
> >> +		/* +1 skip the max_numa_index in the property */
> >> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> >> +
> >> +
> >> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
> >> +
> >> +		WARN(1, "Wrong NUMA distance information\n");
> >> +		/* consider everybody else just remote. */
> >> +		for (i = 0;  i < max_numa_index; i++) {
> >> +			for (j = 0; j < max_numa_index; j++) {
> >> +				int nodeA = numa_id_index_table[i];
> >> +				int nodeB = numa_id_index_table[j];
> >> +
> >> +				if (nodeA == nodeB)
> >> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
> >> +				else
> >> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	distance_index = 0;
> >> +	for (i = 0;  i < max_numa_index; i++) {
> >> +		for (j = 0; j < max_numa_index; j++) {
> >> +			int nodeA = numa_id_index_table[i];
> >> +			int nodeB = numa_id_index_table[j];
> >> +
> >> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
> >> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> >> +		}
> >> +	}
> >>  }
> >>  
> >>  static int __init find_primary_domain_index(void)
> >> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
> >>  	 */
> >>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
> >>  		affinity_form = FORM1_AFFINITY;
> >> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> >> +		dbg("Using form 2 affinity\n");
> >> +		affinity_form = FORM2_AFFINITY;
> >>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
> >>  		dbg("Using form 1 affinity\n");
> >>  		affinity_form = FORM1_AFFINITY;
> >> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
> >>  
> >>  		index = of_read_number(&distance_ref_points[1], 1);
> >>  	} else {
> >> +		/*
> >> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> >> +		 * at the same offset.
> >> +		 */
> >>  		index = of_read_number(distance_ref_points, 1);
> >>  	}
> >> +	/*
> >> +	 * If it is FORM2 also initialize the distance table here.
> >> +	 */
> >> +	if (affinity_form == FORM2_AFFINITY)
> >> +		initialize_form2_numa_distance_lookup_table(root);
> >
> > Ew.  Calling a function called "find_primary_domain_index" to also
> > initialize the main distance table is needlessly counterintuitive.
> > Move this call to parse_numa_properties().
> 
> The reason I ended up doing it here is because 'root' is already fetched
> here. But I agree it is confusing. I will move fetching of root inside
> initialize_form2_numa_distance_lookup_table() and move the function
> outside primary_index lookup.

Ok.  This is not a hot path anyway, so looking up root twice isn't
really a big deal anyway.

> 
> modified   arch/powerpc/mm/numa.c
> @@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node)
>   * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>   * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>   */
> -static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +static void initialize_form2_numa_distance_lookup_table()
>  {
>  	int i, j;
> +	struct device_node *root;
>  	const __u8 *numa_dist_table;
>  	const __be32 *numa_lookup_index;
>  	int numa_dist_table_length;
>  	int max_numa_index, distance_index;
>  
> +	if (firmware_has_feature(FW_FEATURE_OPAL))
> +		root = of_find_node_by_path("/ibm,opal");
> +	else
> +		root = of_find_node_by_path("/rtas");
> +	if (!root)
> +		root = of_find_node_by_path("/");
> +
>  	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>  	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>  
> @@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root
>  			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
>  		}
>  	}
> +	of_node_put(root);
>  }
>  
>  static int __init find_primary_domain_index(void)
> @@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void)
>  		 */
>  		index = of_read_number(distance_ref_points, 1);
>  	}
> -	/*
> -	 * If it is FORM2 also initialize the distance table here.
> -	 */
> -	if (affinity_form == FORM2_AFFINITY)
> -		initialize_form2_numa_distance_lookup_table(root);
> -
>  	/*
>  	 * Warn and cap if the hardware supports more than
>  	 * MAX_DISTANCE_REF_POINTS domains.
> @@ -916,6 +919,12 @@ static int __init parse_numa_properties(void)
>  
>  	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
>  
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table();
> +
>  	/*
>  	 * Even though we connect cpus to numa domains later in SMP
>  	 * init, we need to know the node ids now. This is because
> 
> -aneesh
>
diff mbox series

Patch

diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
new file mode 100644
index 000000000000..31cc7da2c7a6
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,103 @@ 
+============================
+NUMA resource associativity
+=============================
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-----
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-----
+With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
+device tree properties are used to determine the NUMA distance between resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers (domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or more numbers
+(domainID index) that represents the 1 based ordinal in the associativity lists.
+The list of domainID indexes represents an increasing hierarchy of resource grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
+Linux kernel computes NUMA distance between two domains by recursively comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+-------
+Form 2 associativity format adds separate device tree properties representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows flexible primary
+domain numbering. With numa distance computation now detached from the index value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
+ids at the same domainID index representing resource groups of different performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
+the domainIDs present in the system. The offset of the domainID in this property is
+used as an index while computing numa distance information via "ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
+computing the distance of domain 8 from other domains present in the system. For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
+The number N must be equal to the square of m where m is the number of domainIDs in the
+numa-lookup-index-table.
+
+For ex:
+ibm,numa-lookup-index-table =  {3, 0, 8, 40}
+ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
+
+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+A possible "ibm,associativity" property for resources in node 0, 8 and 40
+
+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40}
+
+With "ibm,associativity-reference-points"  { 0x3 }
+
+"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
+Since domainID can be sparse, the matrix of distances can also be effectively sparse.
+With "ibm,lookup-index-table" we can achieve a compact representation of
+distance information.
diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 60b631161360..97a3bd9ffeb9 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -53,6 +53,7 @@ 
 #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
 #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
 #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
+#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -73,7 +74,7 @@  enum {
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
-		FW_FEATURE_RPT_INVALIDATE,
+		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
 	FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index df9fec9d232c..5c80152e8f18 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -149,6 +149,7 @@  extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_XCMO		0x0440	/* Page Coalescing */
 #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
+#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5d9ea059594f..c483df6c9393 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1069,7 +1069,8 @@  static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
+		OV5_FEAT(OV5_FORM2_AFFINITY),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c6293037a103..c68846fc9550 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,12 +56,17 @@  static int n_mem_addr_cells, n_mem_size_cells;
 
 #define FORM0_AFFINITY 0
 #define FORM1_AFFINITY 1
+#define FORM2_AFFINITY 2
 static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
+	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
+};
+static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -166,6 +171,44 @@  static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+	int nid = NUMA_NO_NODE;
+
+	if (!numa_enabled)
+		goto out;
+
+	if (of_read_number(associativity, 1) >= primary_domain_index)
+		nid = of_read_number(&associativity[primary_domain_index], 1);
+
+	/* POWER4 LPAR uses 0xffff as invalid node */
+	if (nid == 0xffff || nid >= nr_node_ids)
+		nid = NUMA_NO_NODE;
+out:
+	return nid;
+}
+
+static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	int dist;
+	int node1, node2;
+
+	node1 = associativity_to_nid(cpu1_assoc);
+	node2 = associativity_to_nid(cpu2_assoc);
+
+	dist = numa_distance_table[node1][node2];
+	if (dist <= LOCAL_DISTANCE)
+		return 0;
+	else if (dist <= REMOTE_DISTANCE)
+		return 1;
+	else
+		return 2;
+}
+
 static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
@@ -186,8 +229,9 @@  int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	/* We should not get called with FORM0 */
 	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
-
-	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	if (affinity_form == FORM1_AFFINITY)
+		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
 }
 
 /* must hold reference to node during call */
@@ -201,7 +245,9 @@  int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (affinity_form == FORM0_AFFINITY)
+	if (affinity_form == FORM2_AFFINITY)
+		return numa_distance_table[a][b];
+	else if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -216,27 +262,6 @@  int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
-{
-	int nid = NUMA_NO_NODE;
-
-	if (!numa_enabled)
-		goto out;
-
-	if (of_read_number(associativity, 1) >= primary_domain_index)
-		nid = of_read_number(&associativity[primary_domain_index], 1);
-
-	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= nr_node_ids)
-		nid = NUMA_NO_NODE;
-out:
-	return nid;
-}
-
 /* Returns the nid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -305,12 +330,84 @@  static void initialize_form1_numa_distance(struct device_node *node)
  */
 void update_numa_distance(struct device_node *node)
 {
+	int nid;
+
 	if (affinity_form == FORM0_AFFINITY)
 		return;
 	else if (affinity_form == FORM1_AFFINITY) {
 		initialize_form1_numa_distance(node);
 		return;
 	}
+
+	/* FORM2 affinity  */
+	nid = of_node_to_nid_single(node);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	/*
+	 * With FORM2 we expect NUMA distance of all possible NUMA
+	 * nodes to be provided during boot.
+	 */
+	WARN(numa_distance_table[nid][nid] == -1,
+	     "NUMA distance details for node %d not provided\n", nid);
+}
+
+/*
+ * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
+ * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
+ */
+static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
+{
+	int i, j;
+	const __u8 *numa_dist_table;
+	const __be32 *numa_lookup_index;
+	int numa_dist_table_length;
+	int max_numa_index, distance_index;
+
+	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
+	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
+
+	/* first element of the array is the size and is encode-int */
+	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
+	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
+	/* Skip the size which is encoded int */
+	numa_dist_table += sizeof(__be32);
+
+	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
+		 numa_dist_table_length, max_numa_index);
+
+	for (i = 0; i < max_numa_index; i++)
+		/* +1 skip the max_numa_index in the property */
+		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
+
+
+	if (numa_dist_table_length != max_numa_index * max_numa_index) {
+
+		WARN(1, "Wrong NUMA distance information\n");
+		/* consider everybody else just remote. */
+		for (i = 0;  i < max_numa_index; i++) {
+			for (j = 0; j < max_numa_index; j++) {
+				int nodeA = numa_id_index_table[i];
+				int nodeB = numa_id_index_table[j];
+
+				if (nodeA == nodeB)
+					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
+				else
+					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
+			}
+		}
+	}
+
+	distance_index = 0;
+	for (i = 0;  i < max_numa_index; i++) {
+		for (j = 0; j < max_numa_index; j++) {
+			int nodeA = numa_id_index_table[i];
+			int nodeB = numa_id_index_table[j];
+
+			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
+			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
+		}
+	}
 }
 
 static int __init find_primary_domain_index(void)
@@ -323,6 +420,9 @@  static int __init find_primary_domain_index(void)
 	 */
 	if (firmware_has_feature(FW_FEATURE_OPAL)) {
 		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
+		dbg("Using form 2 affinity\n");
+		affinity_form = FORM2_AFFINITY;
 	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
 		dbg("Using form 1 affinity\n");
 		affinity_form = FORM1_AFFINITY;
@@ -367,8 +467,17 @@  static int __init find_primary_domain_index(void)
 
 		index = of_read_number(&distance_ref_points[1], 1);
 	} else {
+		/*
+		 * Both FORM1 and FORM2 affinity find the primary domain details
+		 * at the same offset.
+		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
+	/*
+	 * If it is FORM2 also initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table(root);
 
 	/*
 	 * Warn and cap if the hardware supports more than
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 5d4c2bc20bba..f162156b7b68 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -123,6 +123,7 @@  vec5_fw_features_table[] = {
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
+	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
 };
 
 static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)