diff mbox

[RFC,v2,2/4] Documentation: arm64/arm: dt bindings for numa.

Message ID 1416605010-10442-3-git-send-email-ganapatrao.kulkarni@caviumnetworks.com
State Superseded, archived
Headers show

Commit Message

Ganapatrao Kulkarni Nov. 21, 2014, 9:23 p.m. UTC
DT bindings for numa map for memory, cores to node and
proximity distance matrix of nodes to each other.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
---
 Documentation/devicetree/bindings/arm/numa.txt | 103 +++++++++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/numa.txt

Comments

Shannon Zhao Nov. 25, 2014, 3:55 a.m. UTC | #1
Hi,

On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
> DT bindings for numa map for memory, cores to node and
> proximity distance matrix of nodes to each other.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> ---
>  Documentation/devicetree/bindings/arm/numa.txt | 103 +++++++++++++++++++++++++
>  1 file changed, 103 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/arm/numa.txt
> 
> diff --git a/Documentation/devicetree/bindings/arm/numa.txt b/Documentation/devicetree/bindings/arm/numa.txt
> new file mode 100644
> index 0000000..ec6bf2d
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/numa.txt
> @@ -0,0 +1,103 @@
> +==============================================================================
> +NUMA binding description.
> +==============================================================================
> +
> +==============================================================================
> +1 - Introduction
> +==============================================================================
> +
> +Systems employing a Non Uniform Memory Access (NUMA) architecture contain
> +collections of hardware resources including processors, memory, and I/O buses,
> +that comprise what is commonly known as a ??UMA node?? Processor
> +accesses to memory within the local NUMA node is
> +generally faster than processor accesses to memory outside of the local
> +NUMA node. DT defines interfaces that allow the platform to convey NUMA node
> +topology information to OS.
> +
> +==============================================================================
> +2 - numa-map node
> +==============================================================================
> +
> +DT Binding for NUMA can be defined for memory and CPUs to map them to
> +respective NUMA nodes.
> +
> +The DT binding can defined using numa-map node.
> +The numa-map will have following properties to define NUMA topology.
> +
> +- mem-map:	This property defines the association between a range of
> +		memory and the proximity domain/numa node to which it belongs.
> +
> +note: memory range address is passed using either memory node of
> +DT or UEFI system table and should match to the address defined in mem-map.
> +
> +- cpu-map:	This property defines the association of range of processors
> +		(range of cpu ids) and the proximity domain to which
> +		the processor belongs.
> +
> +- node-matrix:	This table provides a matrix that describes the relative
> +		distance (memory latency) between all System Localities.
> +		The value of each Entry[i j distance] in node-matrix table,
> +		where i represents a row of a matrix and j represents a
> +		column of a matrix, indicates the relative distances
> +		from Proximity Domain/Numa node i to every other
> +		node j in the system (including itself).
> +
> +The numa-map node must contain the appropriate #address-cells,
> +#size-cells and #node-count properties.
> +
> +
> +==============================================================================
> +4 - Example dts
> +==============================================================================
> +
> +Example 1: 2 Node system each having 8 CPUs and a Memory.
> +
> +	numa-map {
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		#node-count = <2>;
> +		mem-map =  <0x0 0x00000000 0>,
> +		           <0x100 0x00000000 1>;
> +
> +		cpu-map = <0 7 0>,
> +			  <8 15 1>;

The cpu range is continuous here. But if there is a situation like below:

0 2 4 6 belong to node 0
1 3 5 7 belong to node 1

This case is very common on X86. I don't know the real situation of arm as
I don't have a hardware with 2 nodes.

How can we generate a DTS about this situation? like below? Can be parsed?

		cpu-map = <0 2 4 6 0>,
			  <1 3 5 7 1>;

Thanks,
Shannon

> +
> +		node-matrix = <0 0 10>,
> +			      <0 1 20>,
> +			      <1 0 20>,
> +			      <1 1 10>;
> +	};
> +
> +Example 2: 4 Node system each having 4 CPUs and a Memory.
> +
> +	numa-map {
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		#node-count = <2>;
> +		mem-map =  <0x0 0x00000000 0>,
> +		           <0x100 0x00000000 1>,
> +		           <0x200 0x00000000 2>,
> +		           <0x300 0x00000000 3>;
> +
> +		cpu-map = <0 7 0>,
> +			  <8 15 1>,
> +			  <16 23 2>,
> +			  <24 31 3>;
> +
> +		node-matrix = <0 0 10>,
> +			      <0 1 20>,
> +			      <0 2 20>,
> +			      <0 3 20>,
> +			      <1 0 20>,
> +			      <1 1 10>,
> +			      <1 2 20>,
> +			      <1 3 20>,
> +			      <2 0 20>,
> +			      <2 1 20>,
> +			      <2 2 10>,
> +			      <2 3 20>,
> +			      <3 0 20>,
> +			      <3 1 20>,
> +			      <3 2 20>,
> +			      <3 3 10>;
> +	};
> 
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Hanjun Guo Nov. 25, 2014, 9:42 a.m. UTC | #2
On 2014-11-25 11:55, Shannon Zhao wrote:
> Hi,
> 
> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
[...]
>> +==============================================================================
>> +4 - Example dts
>> +==============================================================================
>> +
>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>> +
>> +	numa-map {
>> +		#address-cells = <2>;
>> +		#size-cells = <1>;
>> +		#node-count = <2>;
>> +		mem-map =  <0x0 0x00000000 0>,
>> +		           <0x100 0x00000000 1>;
>> +
>> +		cpu-map = <0 7 0>,
>> +			  <8 15 1>;
> 
> The cpu range is continuous here. But if there is a situation like below:
> 
> 0 2 4 6 belong to node 0
> 1 3 5 7 belong to node 1
> 
> This case is very common on X86. I don't know the real situation of arm as
> I don't have a hardware with 2 nodes.
> 
> How can we generate a DTS about this situation? like below? Can be parsed?
> 
> 		cpu-map = <0 2 4 6 0>,
> 			  <1 3 5 7 1>;

I think the binding proposed here can not cover your needs, and I think this
binding is not suitable, there are some reasons.

 - CPU logical ID is allocated by OS, and it depends on the order of CPU node
   in the device tree, so it may be in a clean order like this patch proposed,
   or it will like the order Shannon pointed out.

 - Since CPU logical ID is allocated by OS, DTS file will not know these
   numbers.

So the problem behind this is the mappings between CPUs and NUMA nodes,
there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
and MPIDR will be not changed, why not using MPIDR for the mapping of
NUMA node and CPU? then the mappings will be:

CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
(allocated by OS)      (constant)       (allocated by OS)

Thanks
Hanjun
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Nov. 25, 2014, 11:02 a.m. UTC | #3
On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
> On 2014-11-25 11:55, Shannon Zhao wrote:
> > Hi,
> > 
> > On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
> [...]
> >> +==============================================================================
> >> +4 - Example dts
> >> +==============================================================================
> >> +
> >> +Example 1: 2 Node system each having 8 CPUs and a Memory.
> >> +
> >> +    numa-map {
> >> +            #address-cells = <2>;
> >> +            #size-cells = <1>;
> >> +            #node-count = <2>;
> >> +            mem-map =  <0x0 0x00000000 0>,
> >> +                       <0x100 0x00000000 1>;
> >> +
> >> +            cpu-map = <0 7 0>,
> >> +                      <8 15 1>;
> > 
> > The cpu range is continuous here. But if there is a situation like below:
> > 
> > 0 2 4 6 belong to node 0
> > 1 3 5 7 belong to node 1
> > 
> > This case is very common on X86. I don't know the real situation of arm as
> > I don't have a hardware with 2 nodes.
> > 
> > How can we generate a DTS about this situation? like below? Can be parsed?
> > 
> >               cpu-map = <0 2 4 6 0>,
> >                         <1 3 5 7 1>;
> 
> I think the binding proposed here can not cover your needs, and I think this
> binding is not suitable, there are some reasons.
> 
>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>    in the device tree, so it may be in a clean order like this patch proposed,
>    or it will like the order Shannon pointed out.
> 
>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>    numbers.

Also:

- you cannot support hierarchical NUMA topology

- you cannot have CPU-less or memory-less nodes

- you cannot associate I/O devices with NUMA nodes, only memory and CPU

> So the problem behind this is the mappings between CPUs and NUMA nodes,
> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
> and MPIDR will be not changed, why not using MPIDR for the mapping of
> NUMA node and CPU? then the mappings will be:
> 
> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
> (allocated by OS)      (constant)       (allocated by OS)

No, don't hardcode ARM specifics into a common binding either. I've looked
at the ibm,associativity properties again, and I think we should just use
those, they can cover all cases and are completely independent of the
architecture. We should probably discuss about the property name though,
as using the "ibm," prefix might not be the best idea.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ganapatrao Kulkarni Nov. 25, 2014, 1:15 p.m. UTC | #4
Hi Arnd,

On Tue, Nov 25, 2014 at 6:02 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>> > Hi,
>> >
>> > On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>> >> +==============================================================================
>> >> +4 - Example dts
>> >> +==============================================================================
>> >> +
>> >> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>> >> +
>> >> +    numa-map {
>> >> +            #address-cells = <2>;
>> >> +            #size-cells = <1>;
>> >> +            #node-count = <2>;
>> >> +            mem-map =  <0x0 0x00000000 0>,
>> >> +                       <0x100 0x00000000 1>;
>> >> +
>> >> +            cpu-map = <0 7 0>,
>> >> +                      <8 15 1>;
>> >
>> > The cpu range is continuous here. But if there is a situation like below:
>> >
>> > 0 2 4 6 belong to node 0
>> > 1 3 5 7 belong to node 1
>> >
>> > This case is very common on X86. I don't know the real situation of arm as
>> > I don't have a hardware with 2 nodes.
>> >
>> > How can we generate a DTS about this situation? like below? Can be parsed?
>> >
>> >               cpu-map = <0 2 4 6 0>,
>> >                         <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>    in the device tree, so it may be in a clean order like this patch proposed,
>>    or it will like the order Shannon pointed out.
>>
>>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>>    numbers.
>
> Also:
>
> - you cannot support hierarchical NUMA topology
>
> - you cannot have CPU-less or memory-less nodes
>
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU
>
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
>
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.
We have started with new proposal, since not got enough details how
ibm/ppc is managing the numa using dt.
there is no documentation and there is no power/PAPR spec for numa in
public domain and there are no single dt file in arch/powerpc which
describes the numa. if we get any one of these details, we can align
to powerpc implementation.
>
>         Arnd

thanks
ganapat
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hanjun Guo Nov. 25, 2014, 2:54 p.m. UTC | #5
Hi Arnd,

On 2014年11月25日 19:02, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>>> Hi,
>>>
>>> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>>>> +==============================================================================
>>>> +4 - Example dts
>>>> +==============================================================================
>>>> +
>>>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>>>> +
>>>> +    numa-map {
>>>> +            #address-cells = <2>;
>>>> +            #size-cells = <1>;
>>>> +            #node-count = <2>;
>>>> +            mem-map =  <0x0 0x00000000 0>,
>>>> +                       <0x100 0x00000000 1>;
>>>> +
>>>> +            cpu-map = <0 7 0>,
>>>> +                      <8 15 1>;
>>>
>>> The cpu range is continuous here. But if there is a situation like below:
>>>
>>> 0 2 4 6 belong to node 0
>>> 1 3 5 7 belong to node 1
>>>
>>> This case is very common on X86. I don't know the real situation of arm as
>>> I don't have a hardware with 2 nodes.
>>>
>>> How can we generate a DTS about this situation? like below? Can be parsed?
>>>
>>>                cpu-map = <0 2 4 6 0>,
>>>                          <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>   - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>     in the device tree, so it may be in a clean order like this patch proposed,
>>     or it will like the order Shannon pointed out.
>>
>>   - Since CPU logical ID is allocated by OS, DTS file will not know these
>>     numbers.
>
> Also:
>
> - you cannot support hierarchical NUMA topology
>
> - you cannot have CPU-less or memory-less nodes
>
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU

Yes, I agree.

>
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
>
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.

Is there any doc/code related to this? please give me some hints and I
will read that.

Thanks
Hanjun
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Nov. 25, 2014, 7 p.m. UTC | #6
On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> > No, don't hardcode ARM specifics into a common binding either. I've looked
> > at the ibm,associativity properties again, and I think we should just use
> > those, they can cover all cases and are completely independent of the
> > architecture. We should probably discuss about the property name though,
> > as using the "ibm," prefix might not be the best idea.
>
> We have started with new proposal, since not got enough details how
> ibm/ppc is managing the numa using dt.
> there is no documentation and there is no power/PAPR spec for numa in
> public domain and there are no single dt file in arch/powerpc which
> describes the numa. if we get any one of these details, we can align
> to powerpc implementation.

Basically the idea is to have an "ibm,associativity" property in each
bus or device that is node specific, and this includes all CPUs and
memory nodes. The property contains an array of 32-bit integers that
count the resources. Take an example of a NUMA cluster of two machines
with four sockets and four cores each (32 cores total), a memory
channel on each socket and one PCI host per board that is connected
at equal speed to each socket on the board.

The ibm,associativity property in each PCI host, CPU or memory device
node consequently has an array of three (board, socket, core) integers:

	memory@0,0 {
		device_type = "memory";
		reg = <0x0 0x0  0x4 0x0;
		/* board 0, socket 0, no specific core */
		ibm,asssociativity = <0 0 0xffff>;
	};

	memory@4,0 {
		device_type = "memory";
		reg = <0x4 0x0  0x4 0x0>;
		/* board 0, socket 1, no specific core */
		ibm,asssociativity = <0 1 0xffff>; 
	};

	...

	memory@1c,0 {
		device_type = "memory";
		reg = <0x1c 0x0  0x4 0x0>;
		/* board 0, socket 7, no specific core */
		ibm,asssociativity = <1 7 0xffff>; 
	};

	cpus {
		#address-cells = <2>;
		#size-cells = <0>;
		cpu@0 {
			device_type = "cpu";
			reg = <0 0>;
			/* board 0, socket 0, core 0*/
			ibm,asssociativity = <0 0 0>; 
		};

		cpu@1 {
			device_type = "cpu";
			reg = <0 0>;
			/* board 0, socket 0, core 0*/
			ibm,asssociativity = <0 0 0>; 
		};

		...

		cpu@31 {
			device_type = "cpu";
			reg = <0 32>;
			/* board 1, socket 7, core 31*/
			ibm,asssociativity = <1 7 31>; 
		};
	};

	pci@100,0 {
		device_type = "pci";
		/* board 0 */
		ibm,associativity = <0 0xffff 0xffff>;
		...
	};

	pci@200,0 {
		device_type = "pci";
		/* board 1 */
		ibm,associativity = <1 0xffff 0xffff>;
		...
	};

	ibm,associativity-reference-points = <0 1>;

The "ibm,associativity-reference-points" property here indicates that index 2
of each array is the most important NUMA boundary for the particular system,
because the performance impact of allocating memory on the remote board 
is more significant than the impact of using memory on a remote socket of the
same board. Linux will consequently use the first field in the array as
the NUMA node ID. If the link between the boards however is relatively fast,
so you care mostly about allocating memory on the same socket, but going to
another board isn't much worse than going to another socket on the same
board, this would be

	ibm,associativity-reference-points = <1 0>;

so Linux would ignore the board ID and use the socket ID as the NUMA node
number. The same would apply if you have only one (otherwise identical
board, then you would get

	ibm,associativity-reference-points = <1>;

which means that index 0 is completely irrelevant for NUMA considerations
and you just care about the socket ID. In this case, devices on the PCI
bus would also not care about NUMA policy and just allocate buffers from
anywhere, while in original example Linux would allocate DMA buffers only
from the local board.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Nov. 25, 2014, 9:09 p.m. UTC | #7
On Tuesday 25 November 2014 20:00:42 Arnd Bergmann wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> > > No, don't hardcode ARM specifics into a common binding either. I've looked
> > > at the ibm,associativity properties again, and I think we should just use
> > > those, they can cover all cases and are completely independent of the
> > > architecture. We should probably discuss about the property name though,
> > > as using the "ibm," prefix might not be the best idea.
> >
> > We have started with new proposal, since not got enough details how
> > ibm/ppc is managing the numa using dt.
> > there is no documentation and there is no power/PAPR spec for numa in
> > public domain and there are no single dt file in arch/powerpc which
> > describes the numa. if we get any one of these details, we can align
> > to powerpc implementation.
> 
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. ...

I should have mentioned that the example I gave was still rather basic.
In a larger real-world system, you have more levels of associativity,
though not all of them are relevant for NUMA memory allocation.
Also, when you have levels that are not just a crossbar but instead
have multiple point-to-point connections or a ring bus, it gets more
complex, but you can still represent it with these properties.

For task placement, the associativity would also represent the
topology within one node (SMT threads, cores, clusters, chips,
mcms, sockets) as separate levels, and in large installations you
would have multiple levels of memory topology (memory controllers,
sockets, board/blade, chassis, rack, ...), which can get taken into
account for memory allocation to find the closest node. The metric
that you use here is how many levels within the topology are matching
between two devices (typically memory and i/o device, or memory and cpu).

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shannon Zhao Nov. 26, 2014, 2:29 a.m. UTC | #8
On 2014/11/25 19:02, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>>> Hi,
>>>
>>> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>>>> +==============================================================================
>>>> +4 - Example dts
>>>> +==============================================================================
>>>> +
>>>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>>>> +
>>>> +    numa-map {
>>>> +            #address-cells = <2>;
>>>> +            #size-cells = <1>;
>>>> +            #node-count = <2>;
>>>> +            mem-map =  <0x0 0x00000000 0>,
>>>> +                       <0x100 0x00000000 1>;
>>>> +
>>>> +            cpu-map = <0 7 0>,
>>>> +                      <8 15 1>;
>>>
>>> The cpu range is continuous here. But if there is a situation like below:
>>>
>>> 0 2 4 6 belong to node 0
>>> 1 3 5 7 belong to node 1
>>>
>>> This case is very common on X86. I don't know the real situation of arm as
>>> I don't have a hardware with 2 nodes.
>>>
>>> How can we generate a DTS about this situation? like below? Can be parsed?
>>>
>>>               cpu-map = <0 2 4 6 0>,
>>>                         <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>    in the device tree, so it may be in a clean order like this patch proposed,
>>    or it will like the order Shannon pointed out.
>>
>>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>>    numbers.
> 
> Also:
> 
> - you cannot support hierarchical NUMA topology
> 
> - you cannot have CPU-less or memory-less nodes
> 
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU
> 
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
> 
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.
> 

Yeah, I have read the relevant codes in qemu. I think the "ibm,associativity" is more scalable:-)

About the prefix, my opinion is that as this is relevant with NUMA, maybe we can use "numa" as the prefix.

Thanks,
Shannon

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hanjun Guo Nov. 26, 2014, 9:12 a.m. UTC | #9
On 2014-11-26 3:00, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
>>> No, don't hardcode ARM specifics into a common binding either. I've looked
>>> at the ibm,associativity properties again, and I think we should just use
>>> those, they can cover all cases and are completely independent of the
>>> architecture. We should probably discuss about the property name though,
>>> as using the "ibm," prefix might not be the best idea.
>>
>> We have started with new proposal, since not got enough details how
>> ibm/ppc is managing the numa using dt.
>> there is no documentation and there is no power/PAPR spec for numa in
>> public domain and there are no single dt file in arch/powerpc which
>> describes the numa. if we get any one of these details, we can align
>> to powerpc implementation.
> 
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. The property contains an array of 32-bit integers that
> count the resources. Take an example of a NUMA cluster of two machines
> with four sockets and four cores each (32 cores total), a memory
> channel on each socket and one PCI host per board that is connected
> at equal speed to each socket on the board.
> 
> The ibm,associativity property in each PCI host, CPU or memory device
> node consequently has an array of three (board, socket, core) integers:
> 
> 	memory@0,0 {
> 		device_type = "memory";
> 		reg = <0x0 0x0  0x4 0x0;
> 		/* board 0, socket 0, no specific core */
> 		ibm,asssociativity = <0 0 0xffff>;
> 	};
> 
> 	memory@4,0 {
> 		device_type = "memory";
> 		reg = <0x4 0x0  0x4 0x0>;
> 		/* board 0, socket 1, no specific core */
> 		ibm,asssociativity = <0 1 0xffff>; 
> 	};
> 
> 	...
> 
> 	memory@1c,0 {
> 		device_type = "memory";
> 		reg = <0x1c 0x0  0x4 0x0>;
> 		/* board 0, socket 7, no specific core */
> 		ibm,asssociativity = <1 7 0xffff>; 
> 	};
> 
> 	cpus {
> 		#address-cells = <2>;
> 		#size-cells = <0>;
> 		cpu@0 {
> 			device_type = "cpu";
> 			reg = <0 0>;
> 			/* board 0, socket 0, core 0*/
> 			ibm,asssociativity = <0 0 0>; 
> 		};
> 
> 		cpu@1 {
> 			device_type = "cpu";
> 			reg = <0 0>;
> 			/* board 0, socket 0, core 0*/
> 			ibm,asssociativity = <0 0 0>; 
> 		};
> 
> 		...
> 
> 		cpu@31 {
> 			device_type = "cpu";
> 			reg = <0 32>;
> 			/* board 1, socket 7, core 31*/
> 			ibm,asssociativity = <1 7 31>; 
> 		};
> 	};
> 
> 	pci@100,0 {
> 		device_type = "pci";
> 		/* board 0 */
> 		ibm,associativity = <0 0xffff 0xffff>;
> 		...
> 	};
> 
> 	pci@200,0 {
> 		device_type = "pci";
> 		/* board 1 */
> 		ibm,associativity = <1 0xffff 0xffff>;
> 		...
> 	};
> 
> 	ibm,associativity-reference-points = <0 1>;
> 
> The "ibm,associativity-reference-points" property here indicates that index 2
> of each array is the most important NUMA boundary for the particular system,
> because the performance impact of allocating memory on the remote board 
> is more significant than the impact of using memory on a remote socket of the
> same board. Linux will consequently use the first field in the array as
> the NUMA node ID. If the link between the boards however is relatively fast,
> so you care mostly about allocating memory on the same socket, but going to
> another board isn't much worse than going to another socket on the same
> board, this would be
> 
> 	ibm,associativity-reference-points = <1 0>;
> 
> so Linux would ignore the board ID and use the socket ID as the NUMA node
> number. The same would apply if you have only one (otherwise identical
> board, then you would get
> 
> 	ibm,associativity-reference-points = <1>;
> 
> which means that index 0 is completely irrelevant for NUMA considerations
> and you just care about the socket ID. In this case, devices on the PCI
> bus would also not care about NUMA policy and just allocate buffers from
> anywhere, while in original example Linux would allocate DMA buffers only
> from the local board.

Thanks for the detail information. I have the concerns about the distance
for NUMA nodes, does the "ibm,associativity-reference-points" property can
represent the distance between NUMA nodes?

For example, a system with 4 sockets connected like below:

Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3

So from socket 0 to socket 1 (maybe on the same board), it just need 1
jump to access the memory, but from socket 0 to socket 2/3, it needs
2/3 jumps and the *distance* relative longer. Can
"ibm,associativity-reference-points" property cover this?

Thanks
Hanjun

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Nov. 26, 2014, 4:51 p.m. UTC | #10
On Wednesday 26 November 2014 10:29:01 Shannon Zhao wrote:
> On 2014/11/25 19:02, Arnd Bergmann wrote:
> > No, don't hardcode ARM specifics into a common binding either. I've looked
> > at the ibm,associativity properties again, and I think we should just use
> > those, they can cover all cases and are completely independent of the
> > architecture. We should probably discuss about the property name though,
> > as using the "ibm," prefix might not be the best idea.
> > 
> 
> Yeah, I have read the relevant codes in qemu. I think the "ibm,associativity" is more scalable:-)

Ok

> About the prefix, my opinion is that as this is relevant with NUMA,
> maybe we can use "numa" as the prefix.

A prefix should really be the name of a company or institution, so it could
be "arm" or "linux", but not "numa". Would could use "numa-associativity"
with a dash instead of a comma, but that would still be somewhat imprecise
because the associativity property is about system topology inside of
a NUMA domain as well, such as cores, core clusters or SMT threads that
only share caches but not physical memory addresses.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ganapatrao Kulkarni Nov. 30, 2014, 4:38 p.m. UTC | #11
Hi Arnd,


On Tue, Nov 25, 2014 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
>> > No, don't hardcode ARM specifics into a common binding either. I've looked
>> > at the ibm,associativity properties again, and I think we should just use
>> > those, they can cover all cases and are completely independent of the
>> > architecture. We should probably discuss about the property name though,
>> > as using the "ibm," prefix might not be the best idea.
>>
>> We have started with new proposal, since not got enough details how
>> ibm/ppc is managing the numa using dt.
>> there is no documentation and there is no power/PAPR spec for numa in
>> public domain and there are no single dt file in arch/powerpc which
>> describes the numa. if we get any one of these details, we can align
>> to powerpc implementation.
>
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. The property contains an array of 32-bit integers that
> count the resources. Take an example of a NUMA cluster of two machines
> with four sockets and four cores each (32 cores total), a memory
> channel on each socket and one PCI host per board that is connected
> at equal speed to each socket on the board.
thanks for the detailed information.
IMHO, linux-numa code does not care about how the hardware design is,
like how many boards and how many sockets it has. It only needs to
know how many numa nodes system has, how resources are mapped to nodes
and node-distance to define inter node memory access latency. i think
it will be simple, if we merge board and socket to single entry say
node.
also we are assuming here that numa h/w design will have multiple
boards and sockets, what if it has something different/more.

>
> The ibm,associativity property in each PCI host, CPU or memory device
> node consequently has an array of three (board, socket, core) integers:
>
>         memory@0,0 {
>                 device_type = "memory";
>                 reg = <0x0 0x0  0x4 0x0;
>                 /* board 0, socket 0, no specific core */
>                 ibm,asssociativity = <0 0 0xffff>;
>         };
>
>         memory@4,0 {
>                 device_type = "memory";
>                 reg = <0x4 0x0  0x4 0x0>;
>                 /* board 0, socket 1, no specific core */
>                 ibm,asssociativity = <0 1 0xffff>;
>         };
>
>         ...
>
>         memory@1c,0 {
>                 device_type = "memory";
>                 reg = <0x1c 0x0  0x4 0x0>;
>                 /* board 0, socket 7, no specific core */
>                 ibm,asssociativity = <1 7 0xffff>;
>         };
>
>         cpus {
>                 #address-cells = <2>;
>                 #size-cells = <0>;
>                 cpu@0 {
>                         device_type = "cpu";
>                         reg = <0 0>;
>                         /* board 0, socket 0, core 0*/
>                         ibm,asssociativity = <0 0 0>;
>                 };
>
>                 cpu@1 {
>                         device_type = "cpu";
>                         reg = <0 0>;
>                         /* board 0, socket 0, core 0*/
>                         ibm,asssociativity = <0 0 0>;
>                 };
>
>                 ...
>
>                 cpu@31 {
>                         device_type = "cpu";
>                         reg = <0 32>;
>                         /* board 1, socket 7, core 31*/
>                         ibm,asssociativity = <1 7 31>;
>                 };
>         };
>
>         pci@100,0 {
>                 device_type = "pci";
>                 /* board 0 */
>                 ibm,associativity = <0 0xffff 0xffff>;
>                 ...
>         };
>
>         pci@200,0 {
>                 device_type = "pci";
>                 /* board 1 */
>                 ibm,associativity = <1 0xffff 0xffff>;
>                 ...
>         };
>
>         ibm,associativity-reference-points = <0 1>;
>
> The "ibm,associativity-reference-points" property here indicates that index 2
> of each array is the most important NUMA boundary for the particular system,
> because the performance impact of allocating memory on the remote board
> is more significant than the impact of using memory on a remote socket of the
> same board. Linux will consequently use the first field in the array as
> the NUMA node ID. If the link between the boards however is relatively fast,
> so you care mostly about allocating memory on the same socket, but going to
> another board isn't much worse than going to another socket on the same
> board, this would be
>
>         ibm,associativity-reference-points = <1 0>;
i am not able to understand fully, it will be grate help, if you
explain, how we capture the node distance matrix using
"ibm,associativity-reference-points "
for example, how DT looks like for the system with 4 nodes, with below
inter-node distance matrix.
node 0 1 distance 20
node 0 2 distance 20
node 0 3 distance 20
node 1 2 distance 20
node 1 3 distance 20
node 2 3 distance 20
>
> so Linux would ignore the board ID and use the socket ID as the NUMA node
> number. The same would apply if you have only one (otherwise identical
> board, then you would get
>
>         ibm,associativity-reference-points = <1>;
>
> which means that index 0 is completely irrelevant for NUMA considerations
> and you just care about the socket ID. In this case, devices on the PCI
> bus would also not care about NUMA policy and just allocate buffers from
> anywhere, while in original example Linux would allocate DMA buffers only
> from the local board.
>
>         Arnd
thanks
ganapat
ps: sorry for the delayed reply.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Nov. 30, 2014, 5:13 p.m. UTC | #12
On Sunday 30 November 2014 08:38:02 Ganapatrao Kulkarni wrote:

> On Tue, Nov 25, 2014 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> >> > No, don't hardcode ARM specifics into a common binding either. I've looked
> >> > at the ibm,associativity properties again, and I think we should just use
> >> > those, they can cover all cases and are completely independent of the
> >> > architecture. We should probably discuss about the property name though,
> >> > as using the "ibm," prefix might not be the best idea.
> >>
> >> We have started with new proposal, since not got enough details how
> >> ibm/ppc is managing the numa using dt.
> >> there is no documentation and there is no power/PAPR spec for numa in
> >> public domain and there are no single dt file in arch/powerpc which
> >> describes the numa. if we get any one of these details, we can align
> >> to powerpc implementation.
> >
> > Basically the idea is to have an "ibm,associativity" property in each
> > bus or device that is node specific, and this includes all CPUs and
> > memory nodes. The property contains an array of 32-bit integers that
> > count the resources. Take an example of a NUMA cluster of two machines
> > with four sockets and four cores each (32 cores total), a memory
> > channel on each socket and one PCI host per board that is connected
> > at equal speed to each socket on the board.
> thanks for the detailed information.
> IMHO, linux-numa code does not care about how the hardware design is,
> like how many boards and how many sockets it has. It only needs to
> know how many numa nodes system has, how resources are mapped to nodes
> and node-distance to define inter node memory access latency. i think
> it will be simple, if we merge board and socket to single entry say
> node.

But it's not good to rely on implementation details of a particular
operating system.

> also we are assuming here that numa h/w design will have multiple
> boards and sockets, what if it has something different/more.

As I said, this was a simplified example, you can have an arbitrary
number of levels, and normally there are more than three, to capture
the cache hierarchy and other things as well.

> > The "ibm,associativity-reference-points" property here indicates that index 2
> > of each array is the most important NUMA boundary for the particular system,
> > because the performance impact of allocating memory on the remote board
> > is more significant than the impact of using memory on a remote socket of the
> > same board. Linux will consequently use the first field in the array as
> > the NUMA node ID. If the link between the boards however is relatively fast,
> > so you care mostly about allocating memory on the same socket, but going to
> > another board isn't much worse than going to another socket on the same
> > board, this would be
> >
> >         ibm,associativity-reference-points = <1 0>;
> i am not able to understand fully, it will be grate help, if you
> explain, how we capture the node distance matrix using
> "ibm,associativity-reference-points "
> for example, how DT looks like for the system with 4 nodes, with below
> inter-node distance matrix.
> node 0 1 distance 20
> node 0 2 distance 20
> node 0 3 distance 20
> node 1 2 distance 20
> node 1 3 distance 20
> node 2 3 distance 20

In your example, you have only one entry in
ibm,associativity-reference-points as it's even simpler: just
one level of hierarchy, everything is the same distance from
everything else, so within the associativity hierarchy, the
ibm,associativity-reference-points just points to the one
level that indicates a NUMA node.

You would only need multiple entries here if the hierarchy is
complex enough to require multiple levels of topology.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Dec. 10, 2014, 10:57 a.m. UTC | #13
On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
> 
> Thanks for the detail information. I have the concerns about the distance
> for NUMA nodes, does the "ibm,associativity-reference-points" property can
> represent the distance between NUMA nodes?
> 
> For example, a system with 4 sockets connected like below:
> 
> Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3
> 
> So from socket 0 to socket 1 (maybe on the same board), it just need 1
> jump to access the memory, but from socket 0 to socket 2/3, it needs
> 2/3 jumps and the *distance* relative longer. Can
> "ibm,associativity-reference-points" property cover this?
> 

Hi Hanjun,

I only today found your replies in my spam folder, I need to put you on
a whitelist so that doesn't happen again.

The above topology is not easy to represent, but I think it would work
like this (ignoring the threads/cores/clusters on the socket, which
would also need to be described in a full DT), using multiple logical
paths between the nodes:

socket 0
ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>, <3 0 0 0>;

socket 1
ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>, <3 3 1 1>;

socket 2
ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>, <3 3 3 2>;

socket 3
ibm,associativity = <3 3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;

This describes four levels or hierarchy, with the lowest level
being a single CPU core on one socket, and four paths between
the sockets. To compute the associativity between two sockets,
you need to look at each combination of paths to find the best
match.

Comparing sockets 0 and 1, the best matches are <1 1 1 0>
with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
associativity is "3", meaning the first three entries match.

Comparing sockets 0 and 3, we have four equally bad matches
that each only match in the highest-level domain, e.g. <0 0 0 0>
with <0 3 3 3>, so the associativity is only "1", and that means
the two nodes are less closely associated than two neighboring
ones.

With the algorithm that powerpc uses to turn associativity into
distance, 2**(numlevels - associativity), this would put the
distance of neighboring nodes at "2", and the longest distance
at "8".

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hanjun Guo Dec. 11, 2014, 9:16 a.m. UTC | #14
Hi Arnd,

On 2014年12月10日 18:57, Arnd Bergmann wrote:
> On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
>>
>> Thanks for the detail information. I have the concerns about the distance
>> for NUMA nodes, does the "ibm,associativity-reference-points" property can
>> represent the distance between NUMA nodes?
>>
>> For example, a system with 4 sockets connected like below:
>>
>> Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3
>>
>> So from socket 0 to socket 1 (maybe on the same board), it just need 1
>> jump to access the memory, but from socket 0 to socket 2/3, it needs
>> 2/3 jumps and the *distance* relative longer. Can
>> "ibm,associativity-reference-points" property cover this?
>>
>
> Hi Hanjun,
>
> I only today found your replies in my spam folder, I need to put you on
> a whitelist so that doesn't happen again.

Thanks. I hope my ACPI patches will not scare your email filter :)

>
> The above topology is not easy to represent, but I think it would work
> like this (ignoring the threads/cores/clusters on the socket, which
> would also need to be described in a full DT), using multiple logical
> paths between the nodes:
>
> socket 0
> ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>, <3 0 0 0>;
>
> socket 1
> ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>, <3 3 1 1>;
>
> socket 2
> ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>, <3 3 3 2>;
>
> socket 3
> ibm,associativity = <3 3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
>
> This describes four levels or hierarchy, with the lowest level
> being a single CPU core on one socket, and four paths between
> the sockets. To compute the associativity between two sockets,
> you need to look at each combination of paths to find the best
> match.
>
> Comparing sockets 0 and 1, the best matches are <1 1 1 0>
> with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
> associativity is "3", meaning the first three entries match.
>
> Comparing sockets 0 and 3, we have four equally bad matches
> that each only match in the highest-level domain, e.g. <0 0 0 0>
> with <0 3 3 3>, so the associativity is only "1", and that means
> the two nodes are less closely associated than two neighboring
> ones.
>
> With the algorithm that powerpc uses to turn associativity into
> distance, 2**(numlevels - associativity), this would put the
> distance of neighboring nodes at "2", and the longest distance
> at "8".

Thanks for the explain, I can understand how it works now,
a bit complicated for me and I think the distance property
"node-matrix" in Ganapatrao's patch is straight forward,
what do you think?

Thanks
Hanjun
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Dec. 12, 2014, 2:20 p.m. UTC | #15
On Thursday 11 December 2014 17:16:35 Hanjun Guo wrote:
> On 2014年12月10日 18:57, Arnd Bergmann wrote:
> > On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
> > The above topology is not easy to represent, but I think it would work
> > like this (ignoring the threads/cores/clusters on the socket, which
> > would also need to be described in a full DT), using multiple logical
> > paths between the nodes:
> >
> > socket 0
> > ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>,  0 0 0>;
> >
> > socket 1
> > ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>,  3 1 1>;
> >
> > socket 2
> > ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>,  3 3 2>;
> >
> > socket 3
> > ibm,associativity =  3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
> >
> > This describes four levels or hierarchy, with the lowest level
> > being a single CPU core on one socket, and four paths between
> > the sockets. To compute the associativity between two sockets,
> > you need to look at each combination of paths to find the best
> > match.
> >
> > Comparing sockets 0 and 1, the best matches are <1 1 1 0>
> > with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
> > associativity is "3", meaning the first three entries match.
> >
> > Comparing sockets 0 and 3, we have four equally bad matches
> > that each only match in the highest-level domain, e.g. <0 0 0 0>
> > with <0 3 3 3>, so the associativity is only "1", and that means
> > the two nodes are less closely associated than two neighboring
> > ones.
> >
> > With the algorithm that powerpc uses to turn associativity into
> > distance, 2**(numlevels - associativity), this would put the
> > distance of neighboring nodes at "2", and the longest distance
> > at "8".
> 
> Thanks for the explain, I can understand how it works now,
> a bit complicated for me and I think the distance property
> "node-matrix" in Ganapatrao's patch is straight forward,
> what do you think?

I still think we should go the whole way of having something compatible
with the existing bindings, possibly using different property names
if there are objections to using the "ibm," prefix.

The associativity property is more expressive and lets you describe
things that you can't describe with the mem-map/cpu-map properties,
e.g. devices that are part of the NUMA hierarchy but not associated
to exactly one last-level node.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hanjun Guo Dec. 15, 2014, 3:50 a.m. UTC | #16
On 2014年12月12日 22:20, Arnd Bergmann wrote:
> On Thursday 11 December 2014 17:16:35 Hanjun Guo wrote:
>> On 2014年12月10日 18:57, Arnd Bergmann wrote:
>>> On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
>>> The above topology is not easy to represent, but I think it would work
>>> like this (ignoring the threads/cores/clusters on the socket, which
>>> would also need to be described in a full DT), using multiple logical
>>> paths between the nodes:
>>>
>>> socket 0
>>> ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>,  0 0 0>;
>>>
>>> socket 1
>>> ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>,  3 1 1>;
>>>
>>> socket 2
>>> ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>,  3 3 2>;
>>>
>>> socket 3
>>> ibm,associativity =  3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
>>>
>>> This describes four levels or hierarchy, with the lowest level
>>> being a single CPU core on one socket, and four paths between
>>> the sockets. To compute the associativity between two sockets,
>>> you need to look at each combination of paths to find the best
>>> match.
>>>
>>> Comparing sockets 0 and 1, the best matches are <1 1 1 0>
>>> with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
>>> associativity is "3", meaning the first three entries match.
>>>
>>> Comparing sockets 0 and 3, we have four equally bad matches
>>> that each only match in the highest-level domain, e.g. <0 0 0 0>
>>> with <0 3 3 3>, so the associativity is only "1", and that means
>>> the two nodes are less closely associated than two neighboring
>>> ones.
>>>
>>> With the algorithm that powerpc uses to turn associativity into
>>> distance, 2**(numlevels - associativity), this would put the
>>> distance of neighboring nodes at "2", and the longest distance
>>> at "8".
>>
>> Thanks for the explain, I can understand how it works now,
>> a bit complicated for me and I think the distance property
>> "node-matrix" in Ganapatrao's patch is straight forward,
>> what do you think?
>
> I still think we should go the whole way of having something compatible
> with the existing bindings, possibly using different property names
> if there are objections to using the "ibm," prefix.

I agree that we should keep using existing bindings and not introducing
a new one.

Thanks
Hanjun
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/arm/numa.txt b/Documentation/devicetree/bindings/arm/numa.txt
new file mode 100644
index 0000000..ec6bf2d
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/numa.txt
@@ -0,0 +1,103 @@ 
+==============================================================================
+NUMA binding description.
+==============================================================================
+
+==============================================================================
+1 - Introduction
+==============================================================================
+
+Systems employing a Non Uniform Memory Access (NUMA) architecture contain
+collections of hardware resources including processors, memory, and I/O buses,
+that comprise what is commonly known as a “NUMA node”. Processor
+accesses to memory within the local NUMA node is
+generally faster than processor accesses to memory outside of the local
+NUMA node. DT defines interfaces that allow the platform to convey NUMA node
+topology information to OS.
+
+==============================================================================
+2 - numa-map node
+==============================================================================
+
+DT Binding for NUMA can be defined for memory and CPUs to map them to
+respective NUMA nodes.
+
+The DT binding can defined using numa-map node.
+The numa-map will have following properties to define NUMA topology.
+
+- mem-map:	This property defines the association between a range of
+		memory and the proximity domain/numa node to which it belongs.
+
+note: memory range address is passed using either memory node of
+DT or UEFI system table and should match to the address defined in mem-map.
+
+- cpu-map:	This property defines the association of range of processors
+		(range of cpu ids) and the proximity domain to which
+		the processor belongs.
+
+- node-matrix:	This table provides a matrix that describes the relative
+		distance (memory latency) between all System Localities.
+		The value of each Entry[i j distance] in node-matrix table,
+		where i represents a row of a matrix and j represents a
+		column of a matrix, indicates the relative distances
+		from Proximity Domain/Numa node i to every other
+		node j in the system (including itself).
+
+The numa-map node must contain the appropriate #address-cells,
+#size-cells and #node-count properties.
+
+
+==============================================================================
+4 - Example dts
+==============================================================================
+
+Example 1: 2 Node system each having 8 CPUs and a Memory.
+
+	numa-map {
+		#address-cells = <2>;
+		#size-cells = <1>;
+		#node-count = <2>;
+		mem-map =  <0x0 0x00000000 0>,
+		           <0x100 0x00000000 1>;
+
+		cpu-map = <0 7 0>,
+			  <8 15 1>;
+
+		node-matrix = <0 0 10>,
+			      <0 1 20>,
+			      <1 0 20>,
+			      <1 1 10>;
+	};
+
+Example 2: 4 Node system each having 4 CPUs and a Memory.
+
+	numa-map {
+		#address-cells = <2>;
+		#size-cells = <1>;
+		#node-count = <2>;
+		mem-map =  <0x0 0x00000000 0>,
+		           <0x100 0x00000000 1>,
+		           <0x200 0x00000000 2>,
+		           <0x300 0x00000000 3>;
+
+		cpu-map = <0 7 0>,
+			  <8 15 1>,
+			  <16 23 2>,
+			  <24 31 3>;
+
+		node-matrix = <0 0 10>,
+			      <0 1 20>,
+			      <0 2 20>,
+			      <0 3 20>,
+			      <1 0 20>,
+			      <1 1 10>,
+			      <1 2 20>,
+			      <1 3 20>,
+			      <2 0 20>,
+			      <2 1 20>,
+			      <2 2 10>,
+			      <2 3 20>,
+			      <3 0 20>,
+			      <3 1 20>,
+			      <3 2 20>,
+			      <3 3 10>;
+	};